I've multiple unstructured documents (PDFs and HTMLs). These unstructured documents have a predictable pattern. And there are 'n' instances of these patterns.
I need to write a program to extract information from these documents. The program should be in such a way that once it is trained for a particular pattern, it should be automatically pick the data points from other documents of same pattern.
Which technology to use for writing this program? Any help on specific algorithm will be much appreciated.