Document conversion with Collabora Online, JODConverter and unoconv

JODConverter (for Java OpenDocument Converter) is a widely used tool that automates document conversions. unoconv is a Python tool with a similar purpose. You can read here details about why you should consider switching to JODConverter’s Collabora Online backend or talk to Collabora Online itself.

Supported formats of these tools include OpenDocument, PDF, HTML, Microsoft Office formats (DOC/DOCX/RTF, XLS/XLSX, PPT/PPTX) and many others. They can be used as a Java/Python library, a command line tool, or a web application. Newer versions have a JODConverter backend that uses Collabora Online instead of LibreOffice directly.

What are the benefits of using Collabora Online for document conversion?

  • Improved performance compared to startup-convert-shutdown approach
  • The REST API is more reliable than starting LibreOffice in server mode and communicating via remote UNO
  • More secure because the conversion happens in an isolated environment and this layered approach protects your infrastructure (from outer to inner layers):
    • It is easy to run it in a Virtual Machine / Docker Container
    • Document data isolation into per-document chroots
    • Seccomp-bpf: inside that chroot (almost) no system calls are allowed
    • Extremely sparse filesystem inside the chroot: no shell etc.

BenefitJODConverterunoconvCollabora Online

Many file formats Yes Yes Yes
Single startup cost No No Yes
Standard REST API No No Yes
Easy isolation into VM / docker No No Yes
Document isolation No No Yes
Syscall filter No No Yes
Sparse filesystem No No Yes

This means you get both improved performance and better security when converting documents with Collabora Online.

Performance

The first chart shows how Collabora Online performs compared to JODConverter’s LibreOffice backend and unoconv when we consider threading and measure the number of documents converted during a second:


Want to try out and set up CODE?

Grab it here!

You can see that Collabora Online not only has an initially superior performance, but it also scales better as you use more threads. (We compare curl invocations for Collabora Online with java commandline invocations of JODConverter and python commandline invocations of unoconv.)

Building

If you want to try out JODConverter with its Collabora Online backend:

git clone https://github.com/sbraconnier/jodconverter
cd jodconverter
sh gradlew build -x integTest distZip
cd build/distributions
unzip jodconverter-cli-*.zip
cd jodconverter-cli-*/

Running

  • Example:
    bin/jodconverter-cli -c https://localhost:9980/ -f pdf README.txt
  • The input format is detected automatically, -f determines the output format.
  • The URL is your Collabora Online server URL, it is the https:// value from the installation guide.

Using the Collabora Online REST API directly

  • In case you are not using JODConverter already, you can use the REST API directly, for example:
    curl -F "data=@test.txt" https://localhost:9980/cool/convert-to/pdf > out.pdf
    curl -F "data=@test.txt" https://localhost:9980/cool/convert-to/png > out.png
  • Alternatively you can use the HTML forms to specify the format, for example:
    curl -F "data=@test.txt" -F "format=pdf" https://localhost:9980/cool/convert-to > out.pdf

Supported formats

Supported input formats:

Documents Input formats
Writer documents sxw (view), odt and fodt (edit)
Calc documents sxc (view), ods and fods (edit)
Impress documents sxi (view), odp and fodp (edit)
Draw documents sxd (view), odg and fodg (edit)
Chart documents odc (edit)
Text master documents sxg (view), odm (edit)
Text template documents stw (view), ott (edit)
Writer master document templates otm (edit)
Spreadsheet template documents stc (view), ots (edit)
Presentation template documents sti (view), otp (edit)
Drawing template documents std (view), otg (edit)
Base documents odb (edit)
Extensions oxt (edit)
MS Word doc and dot (edit)
MS Excel xls (edit)
MS PowerPoint ppt (edit)
OOXML wordprocessing docx and docm (edit), dotx and dotm (view)
OOXML spreadsheet xltx and xltm (view), xlsx and xlsb and xlsm (edit)
OOXML presentation pptx, pptm, potx, potm (edit)
Other wpd, pdb, hwp, wps, wri, wk1, cgm, dxf, emf, wmf, cdr, vsd, pub, vss, lrf, gnumeric, mw, numbers, p65, pdf, jpg, jpeg, gif, png, etc (view)
Other dif, slk, csv, dbf, oth, rtf, txt, etc (edit)

Supported output formats for Writer/Calc/Impress:

Documents Output formats
Writer doc for MS Word 97, docm for MS Word 2007 XML VBA, docx for MS Word 2007 XML, fodt for OpenDocument Text Flat XML, html for HTML (StarWriter), odt for writer8, ott for writer8_template, pdf for writer_pdf_Export, rtf for Rich Text Format, txt for Text, xhtml for XHTML Writer File, png for writer_png_Export
Calc csv for Text – txt – csv (StarCalc), fods for OpenDocument Spreadsheet Flat XML, html for HTML (StarCalc), ods for calc8, ots for calc8_template, pdf for calc_pdf_Export, xhtml for XHTML Calc File, xls for MS Excel 97, xlsm for Calc MS Excel 2007 VBA XML, xlsx for Calc MS Excel 2007 XML, png for calc_png_Export
Impress fodp for OpenDocument Presentation Flat XML, html for impress_html_Export, odg for impress8_draw, odp for impress8, otp for impress8_template, pdf for impress_pdf_Export, potm for Impress MS PowerPoint 2007 XML Template, pot for MS PowerPoint 97 Vorlage, pptm for Impress MS PowerPoint 2007 XML VBA, pptx for Impress MS PowerPoint 2007 XML, pps for MS PowerPoint 97 Autoplay, ppt for MS PowerPoint 97, svg for impress_svg_Export, swf for impress_flash_Export, xhtml for XHTML Impress File, png for impress_png_Export
Draw fodg for draw_ODG_FlatXML, html for draw_html_Export, odg for draw8, pdf for draw_pdf_Export, svg for draw_svg_Export, swf for draw_flash_Export, xhtml for XHTML Draw File, png for draw_png_Export

Trusting the local Online HTTP certificate from Java

This is only needed if you have a self-signed certificate for your Online installation.

  • get the certificate:
    openssl s_client -connect localhost:9980 2>&1 | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p'
  • paste it into a file named certfile.txt
  • import it into the Java key store (password is changeit by default):
    keytool -importcert -keystore $JAVA_HOME/jre/lib/security/cacerts -alias mycert -file certfile.txt

Depending on the value of $JAVA_HOME, you may need to run keytool with root/Administrator privileges.

Conclusions

  • Using JODConverter already ? – consider switching to use its safer Collabora Online backend.
  • Using Collabora Online via JodConverter or unoconv? – consider a switch to use the our simple REST conversion API (Java sample code, Python sample code).
  • Using another tool ? – evaluate whether a standard Collabora Online solution meets your performance and conversion needs.

Need support and help integrating document conversion into your product? Feel free to send us an email:

Contact us!