I’ve been away for a while… Actually, I didn’t notice as the time was going on.. even though I felt like I’ve been running all days since my last article 😀 .. I was all in a new dimension of machine learning..one that is so huge and with many many possible doors to knock!
-What is it that you called “new dimension”?
Okay, let me tell you. If you followed my previous post about OpenCV and Tesseract, you may already know that I’ve been working on a computer vision project, involving Object Character Recognition (OCR) for texts in a natural scene image – By the way, I really like the problem that I’m trying to solve, that’s why I accepted working on the project despite the fact that I don’t have a previous knowledge of the topics and technologies used in such contexts. –
Back to the subject 🙂 .. so after trying out with tesseract and openCV, I found that there are many parameters that can affect the result, these parameters relate especially to a phase that it’s emphasized from experienced people.
–What is that phase?
The preprocessing phase of the images. What I mean by preprocessing is the preparation of the input data (in this case, a natural scene image containing text) in a way that allows the algorithm to deliver good results.
–How is that? preparing an image? What can someone do after getting the image?…
Yes preparing it! It’s like when you want to do a cake, you have to prepare its ingredients.. or when you want to teach your child to differentiate between dogs and cats, you have to collect photos of cats and dogs beforehand, that will be your material in the process.
And you can prepare an image by playing around the color intensity of each pixel, regions, contours, edges, alignment.. all these are image properties. These properties are what make variations present in an image e.g blur, noise, contrast, etc. Adjusting these properties to control variations is necessary in order to identify(detect) text regions, segment these regions correctly and recognize the text as a last phase in the pipeline.
I think that at this moment you are starting to understand why it’s a very important phase, isn’t it? Obviously, this phase will influence all consequent phases. So when I realized that I started looking around on how to do it. I checked various research papers, OpenCV functions, tutorials.After doing so and trying some methods (check my GitHub for the code snippet that I tested) like image thresholding (classifying pixels into two groups: below the threshold or above the threshold), dilation and erosion, contour approximation, MSER regions detections using Neuman proposed method, I came to the conclusion: to improve text detection and recognition it needs much of parameters understanding and tuning for these methods whereas I’m extremely limited by deadlines imposed by my client!
Wow, how funny is that! What should I do! HEEELP! Deadlines and newbie in computer vision!… Yeah, I felt frustrated! You surely faced such a situation in a specific phase of a project.
The Me surprised about how much I have to learn and deal with.
Fortunately, when remembering my motivation and goal I decided to keep calm and raise the challenge 2 in my project path! So I read some other papers with more attention to the details of the work presented and I got to a conclusion!
–Oh, finally we are in the conclusion. What is it? And still, I don’t get what you mean by that “new dimension”!
First, the images I’m dealing with have some specific characteristics:
– low contrast between the text and background,
– The distance between characters presents variations,
– The number of characters is not the same everywhere,
-And the text doesn’t involve words, it means it can’t be recognized using a dictionary and a list of choices, but a mix of numeric characters and Alphabetical characters.
With consideration to that, I examined papers with methods based on Convolutional Neural Network and decided to go for it because according to research and some applications of CNN it delivered much better result! Especially Text-Attentional Convolutional Neural Networks for Scene Text Detection method is what interested me because it is based on a more informative supervised information and on improving contrast in order to detect text.
–mmm.. more informative supervised information? Interesting..How is that ensured?
The presented method involves:
-training a CNN with more informative supervised information, such as text region mask, character label, and binary text/non-text information.
-introducing a deep multi-task learning mechanism to learn the Text-CNN efficiently and making it possible to learn more discriminative text features.
-developing a new type of MSERs method by enlarging the local contrast between text and background regions.
(If you know a better paper method that seems to fit my problem, let me know 😉 )
And since that time, which was about approximately two weeks ago, I have started learning Deep Learning with computer vision! That new dimension in machine learning! I tried to familiarize myself with the concept in a fast way so I read these articles that I suggest to you if you want to start with deep learning and computer vision:
– A quick introduction to Neural Network, An intuitive explanation of Convolutional Neural Network, and A beginner’s guide to understanding Convolutional Neural Network. After that, I installed Tensorflow CPU as it’s easier to install than GPU version and Keras because it’s easier (more high level) than Tensorflow and followed Machine Learning Mastery website tutorial: develop your first Neural Network in Python with Keras step by step. There, I discovered something very important! I really need to switch to GPU version because it’s very slow! Buuuuut… to run GPU version, you have to use the Graphics card (NVIDIA) which I found out later that it doesn’t work on a Virtual Machine!!! Oh my God! A new time-consuming thing with an approaching deadline 😀
Apart from that, I didn’t mention that I have to resolve a very important problem too, that was the first reason of trying out other methods before switching to CNN: I don’t have a big amount of data! (you may tell me how then you will make it work with deep learning! Everyone knows that it’s the most important component of Neural Networks! Otherwise, you can get good results because simply your CNN won’t learn much enough from your input images!)…
It’s possible! There are some methods, some of them are less evident than others, but it’s possible!
I will explore more this 3rd challenge later one but for now, I’m very happy to find a complete article about Setting Up Ubuntu + Keras + GPU for Deep Learning which was published yesterday by PyImageSearch! (Or maybe there is another way…Do you have in mind another possible way to get a CNN running faster? )
By the way, if you are starting like me in Deep Learning with computer vision I would be happy to share with each other the feedback of what we experiment along the way because that will help us move faster ;).
And if you are experienced in the field, let me know what you think about my taken strategy in dealing with the project and what possible methods that can help.
Up for your comments :))
Hello Sarah,
Thank you sharing your experience and for all those useful resources and details you have been providing us with.
The way how you describe every challenge you have been taking along the project is like you are pushing everyone to discover that “new dimension” 🙂
I’m also working on a computer vision project which is precisely about document recognition. That’s the first time that I find myself involved in such a project but I’m just doing fun with it!
I’ve also chosen to dig deeper in deep learning. I have prepared an approach to follow in my research project so following this approach, I’ve started working on object detection which is a subfield of computer vision. My goal is detecting tables figuring out in the documents. I’m applying the Faster R-CNN algorithm in order to detect tables in my documents (documents are first transformed into images). The next step, I would work on exploring texts which are in tables detected.
And as you said, working on a deep learning project requires a huge amount of data. I guess you mean data augmentation and transfer learning techniques to be resorting to when we don’t have a lot of images, did you mean these techniques ?
For requirement hardware, I’m no longer working on jupyter notewook as I don’t have sufficient memory and I switch to Google Colab which is a free cloud service based on jupyter notebook and that supports free GPU ( just take a look here https://colab.research.google.com/notebooks/welcome.ipynb ). I find it so powerful.
According to your project, I suggest you to look at this paper ( http://www.di.uniba.it/~malerba/publications/jiis.pdf ) which I found interesting and touches on several issues related for example to how separate text from non-text areas in the document image.
I wish you the best of luck in your work and more success!