Rectangle 27 4

I've had success with Apache POI and reading and writing powerpoint presentations on GAE. The important thing is to avoid calls in POI that would invoke the security restricted java.awt classes. For reading content from a document, java.awt files are avoided so you should be fine. For writing content to a document, this is where you have to be careful. I use a predefined template and adjust the text and fonts directly. This avoids java.awt calls. If you try to create a new PPT document using an existing document as a template (as shown in the POI examples), this will fail due to java.awt calls and GAE prohibiting them. Your mileage may vary using Word docs as I imagine there are less graphical library calls.

You'll probably struggle with new Document formats like Word 2010 and you'll obviously have to use URLfetch / Google Cloud Storage / Blobstore for working with the files. GAE doesn't support native file access.

Does google app engine support apache poi? - Stack Overflow

google-app-engine
Rectangle 27 1

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.15</version>
</dependency>

The classes we used in above code snippet, HSSFWorkbook and HSSFSheet works for .xls format. In order to work with .xlsx use XSSFWorkbook and XSSFSheet class.

HSSF (Horrible SpreadSheet Format) reads and writes Microsoft Excel (XLS) format files.

XSSF (XML SpreadSheet Format) reads and writes Office Open XML (XLSX) format files.

HWPF (Horrible Word Processor Format) aims to read and write Microsoft Word 97 (DOC) format files.

HSLF (Horrible Slide Layout Format) a pure Java implementation for Microsoft PowerPoint files.

HPBF (Horrible PuBlisher Format) a pure Java implementation for Microsoft Publisher files.

HSMF (Horrible Stupid Mail Format) a pure Java implementation for Microsoft Outlook MSG files.

DDF (Dreadful Drawing Format) a package for decoding the Microsoft Office Drawing format.

import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
//..
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("FuSsA sheet");
//Create a new row in current sheet
Row row = sheet.createRow(0);
//Create a new cell in current row
Cell cell = row.createCell(0);
//Set value to new value
cell.setCellValue("Slim Shady");
    try {
        FileOutputStream out = 
                new FileOutputStream(new File("C:\\new.xls"));
        workbook.write(out);
        out.close();
        System.out.println("Excel written successfully..");

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
try {
            FileInputStream file = new FileInputStream(new File("C:\\update.xls"));

            HSSFWorkbook workbook = new HSSFWorkbook(file);
            HSSFSheet sheet = workbook.getSheetAt(0);
            Cell cell = null;

            //Update the value of cell
            cell = sheet.getRow(1).getCell(2);
            cell.setCellValue(cell.getNumericCellValue() * 2);
            cell = sheet.getRow(2).getCell(2);
            cell.setCellValue(cell.getNumericCellValue() * 2);
            cell = sheet.getRow(3).getCell(2);
            cell.setCellValue(cell.getNumericCellValue() * 2);

            file.close();

            FileOutputStream outFile =new FileOutputStream(new File("C:\\update.xls"));
            workbook.write(outFile);
            outFile.close();

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

For more details, Adding Formulas and Adding Styles to Cell you can check this link: Read / Write Excel file in Java using Apache POI

java - Read / Write different Microsoft Office file formats using Apac...

java maven apache-poi xlsx
Rectangle 27 3

It was mentioned only briefly once, so I'd like to call out the docx4j library, as I've had more success with docx4j than anything else. Apache POI's support for Word documents isn't very good. Also, unlike Aspose.Words, docx4j is an open source library.

The only drawback is with docx4j you have to create Office Open XML (docx) format documents rather than OLE2-based (doc) format documents. This is the default format for Word 2007, but Word 2003 and earlier users will need to install a compatibility pack.

What's a good Java API for creating Word documents? - Stack Overflow

java ms-word docx doc
Rectangle 27 1

Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).

Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.

Get text from doc/docx file in pages using Apache tika - Stack Overflo...

apache-tika
Rectangle 27 13

Well my 2 cents when it comes to the topic word 2007 docx, word 97-2004 doc, pdf and all other types of MS Office wishing to be "converted from y to z but in real they don't wanna be". In my experience so far, conversion with LibreOffice or OpenOffice can't be relied on. Though .doc documents tend to be better supported than word 2007's .docx. In general it's very hard to convert the .docx to .doc without breaking anything.

.docx
.doc

The conversion from .doc to PDF was most of the time quite reliable. If you can still influence the design or content of the word document then this might be satisfying, but in my situation documents were supplied from foreign companies where even after generating the .docx templates, in some scenario's, the generated .docx had to be slightly modified with supplement text before it was generated to a PDF.

All this hiccup made me come to the conclusion that the only true reliable conversion method I found was using the COM class in PHP and let the MS Word or Excel Application do all the work for you. I'll just give an example on converting .docx to .doc and/or PDF. If you do not have MS Office installed, you can download a trial version of 60 days which would give you enough room for testing purposes.

the COM.net extension is by default commented out in the php.ini, just search for the line php_com_dotnet.dll and uncomment it like so

extension=php_com_dotnet.dll

Restart the web server (IIS is not a pre, Apache will work just as well).

The code below is a demonstration on how easy it is.

$word = new COM("Word.Application") or die ("Could not initialise Object.");
  // set it to 1 to see the MS Word window (the actual opening of the document)
  $word->Visible = 0;
  // recommend to set to 0, disables alerts like "Do you want MS Word to be the default .. etc"
  $word->DisplayAlerts = 0;
  // open the word 2007-2013 document 
  $word->Documents->Open('yourdocument.docx');
  // save it as word 2003
  $word->ActiveDocument->SaveAs('newdocument.doc');
  // convert word 2007-2013 to PDF
  $word->ActiveDocument->ExportAsFixedFormat('yourdocument.pdf', 17, false, 0, 0, 0, 0, 7, true, true, 2, true, true, false);
  // quit the Word process
  $word->Quit(false);
  // clean up
  unset($word);

This is just a small demonstration. I can just say that if it comes to conversion, this was the only real reliable option I could use and even recommend.

This works well for me on local Xampp with the extension: php_com_dotnet.dll enabled. But when i go live, i suspect i need COM extension on the windows hosting. So this may be a problem with shared hosting.

@shasi Shared hosting environments are not meant for solutions like this and rarely fits any solution which requires custom standardisation. You not only require the COM extension in PHP, but also a Word executable on the same host.

Convert Word doc, docx and Excel xls, xlsx to PDF with PHP - Stack Ove...

php excel ms-word pdf-generation
Rectangle 27 12

Well my 2 cents when it comes to the topic word 2007 docx, word 97-2004 doc, pdf and all other types of MS Office wishing to be "converted from y to z but in real they don't wanna be". In my experience so far, conversion with LibreOffice or OpenOffice can't be relied on. Though .doc documents tend to be better supported than word 2007's .docx. In general it's very hard to convert the .docx to .doc without breaking anything.

.docx
.doc

The conversion from .doc to PDF was most of the time quite reliable. If you can still influence the design or content of the word document then this might be satisfying, but in my situation documents were supplied from foreign companies where even after generating the .docx templates, in some scenario's, the generated .docx had to be slightly modified with supplement text before it was generated to a PDF.

All this hiccup made me come to the conclusion that the only true reliable conversion method I found was using the COM class in PHP and let the MS Word or Excel Application do all the work for you. I'll just give an example on converting .docx to .doc and/or PDF. If you do not have MS Office installed, you can download a trial version of 60 days which would give you enough room for testing purposes.

the COM.net extension is by default commented out in the php.ini, just search for the line php_com_dotnet.dll and uncomment it like so

extension=php_com_dotnet.dll

Restart the web server (IIS is not a pre, Apache will work just as well).

The code below is a demonstration on how easy it is.

$word = new COM("Word.Application") or die ("Could not initialise Object.");
  // set it to 1 to see the MS Word window (the actual opening of the document)
  $word->Visible = 0;
  // recommend to set to 0, disables alerts like "Do you want MS Word to be the default .. etc"
  $word->DisplayAlerts = 0;
  // open the word 2007-2013 document 
  $word->Documents->Open('yourdocument.docx');
  // save it as word 2003
  $word->ActiveDocument->SaveAs('newdocument.doc');
  // convert word 2007-2013 to PDF
  $word->ActiveDocument->ExportAsFixedFormat('yourdocument.pdf', 17, false, 0, 0, 0, 0, 7, true, true, 2, true, true, false);
  // quit the Word process
  $word->Quit(false);
  // clean up
  unset($word);

This is just a small demonstration. I can just say that if it comes to conversion, this was the only real reliable option I could use and even recommend.

This works well for me on local Xampp with the extension: php_com_dotnet.dll enabled. But when i go live, i suspect i need COM extension on the windows hosting. So this may be a problem with shared hosting.

@shasi Shared hosting environments are not meant for solutions like this and rarely fits any solution which requires custom standardisation. You not only require the COM extension in PHP, but also a Word executable on the same host.

Convert Word doc, docx and Excel xls, xlsx to PDF with PHP - Stack Ove...

php excel ms-word pdf-generation
Rectangle 27 2

I remember having used Apache Lucene some time ago to perform searches inside different type of documents from Java, among them PDF and Word files.

However, this question entirely depends on the programming language you're using, so if you're not using Java you might want to specify it.

Unable to convert a pdf file to a pdf file which I can search - Stack ...

pdf
Rectangle 27 0

Thanks for your input! Do you know of any implementation where Apache POI was used to actually create a word document (not only parse it)?

Sorry, I don't know much about it other than it exists.

docx4j (my project) is focused on doing stuff with docx files (as opposed to xlsx, though it handles those as well)

Creating Microsoft Word (.docx) documents in Ruby - Stack Overflow

ruby-on-rails ruby ms-word documents
Rectangle 27 0

One is to use Apache Tika. Tika is a text and metadata extraction toolkit, and is able to extract fairly rich text from Word documents by making appropriate calls to POI. The result is that Tika will give you XHTML style XML for the contents of your word document.

The other option is to use a class that was added fairly recently to POI, which is WordToHtmlConverter. This will turn your word document into HTML for you, and generally will preserve slightly more of the structure and formatting than Tika will.

Depending on the kind of XML you're hoping to get out, one of these should be a good bet for you. I'd suggest you try both against some of your sample files, and see which one is the best fit for your problem domain and needs.

WordToHtmlConverter, whichi Jar file this class is in. I think it is still in early stages of developement and not released as a Jar file ?

It's in the Scratchpad jar file. You'll want to get the latest beta, 3.8 Beta 4, and use the main POI jar + scratchpad jar from that.

java - Is it possible to parse MS Word using Apache POI and convert it...

java ms-word apache-poi
Rectangle 27 0

Apache POI can handle Word documents via its HWPF module, and extract or insert images from these. Although it's not well documented, check out the POI unit tests for image manipulation within Word (the unit tests seem to be the best documentation for this module).

Failing that, the COM interface is accessible via (say) JACOB. That's probably more work, but will make available APIs not exposed via POI.

java - How to programmaticaly extract and manipulate images from an Of...

java ms-word ms-office powerpoint
Rectangle 27 0

CTR ctr = run.getCTR();
ctr.addNewTab();
run.setText("abcd");

java - Creating tab spaces in word document using Apache POI - Stack O...

java ms-word apache-poi
Rectangle 27 0

Add this line in the end of your boucle:

java - Create "Multiple Tables" in ms word document using apache poi -...

java ms-word apache-poi
Rectangle 27 0

This blog has a good, simple tutorial for writing Word documents in Apache POI. It also has a tutorial for working with tables in a later post.

java - Apache POI Word tutorial. - Stack Overflow

java apache apache-poi
Rectangle 27 0

However, just like printing, you will have to sacrifice some graphics, so this would be best used for exporting content (but as you did not specify, I'm just clarifying that this is possible but at a cost).

apache poi - ColdFusion - converting HTML webpage to Word or PDF docum...

coldfusion apache-poi cfdocument
Rectangle 27 0

I think it is possible with POI. You can find and replace relevant line using poi. I haven't tried editing documents

Yeah, that's the thing... they mention it, but there's really little about it. Don't really know how to even go about it :|

Is HWPFDocument not part of apache poi 3.10? It doesn't seem to be able to find the imports.

Sweet, works like a charm. BUT! Microsoft word is complaining saying that the file may be insecure to open and all of that, and won't let me edit it :| any idea what's going on?

java - apache poi Word documents (.doc, .docx) updating - Stack Overfl...

java ms-word apache-poi
Rectangle 27 0

Apache POI XWPF can do it, but it is currently dead. The following Java API's are available as well to handle OpenXML MS Word documents with Java:

There was one more, but I don't recall the name anymore.

As to your functional requirement: merging two documents is technically tricky to achieve the result as the enduser would expect. Most API's won't allow that. You'll need to extract the desired information from two documents and then create one new document based on this information yourself.

How do you decide which to use? I'm between Apache POI and OpenOffice.org. The second one would require to install open office which I think it would be a hit in the performance, is it true?

I guess the best way to decide which to use is to try them with your documents. You can try a commercial tool based on docx4j, at webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml

Is there any java library (maybe poi?) which allows to merge docx file...

java docx
Rectangle 27 0

Use Apache POI's SummaryInformation to fetch the Total page count of a MS word document

Number of pages in a word doc in java - Stack Overflow

java ms-word
Rectangle 27 0

Apache POI - the Java API for Microsoft Documents

The Apache POI Project's mission is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java.

apache - Is there any code to read MS Office PPT on Android? - Stack O...

android apache ms-office apache-poi
Rectangle 27 0

I have succesfully used Apache FOP to convert a 'WordML' document to PDF. WordML is the Office 2003 way of saving a Word document as xml. XSLT stylesheets can be found on the web to transform this xml to xml-fo which in turn can be rendered by FOP into PDF (among other outputs).

It's not so different from the solution plutext offered, except that it doesn't read a .doc document, whereas docx4j apparently does. If your requirements are flexible enough to have WordML style documents as input, this might be worth looking into.

Good luck with your project! Wim

Creating PDF from Word (DOC) using Apache POI and iText in JAVA - Stac...

java ms-word pdf-generation itext apache-poi
Rectangle 27 0

It was mentioned only briefly once, so I'd like to call out the docx4j library, as I've had more success with docx4j than anything else. Apache POI's support for Word documents isn't very good. Also, unlike Aspose.Words, docx4j is an open source library.

The only drawback is with docx4j you have to create Office Open XML (docx) format documents rather than OLE2-based (doc) format documents. This is the default format for Word 2007, but Word 2003 and earlier users will need to install a compatibility pack.

What's a good Java API for creating Word documents? - Stack Overflow

java ms-word docx doc