What is PDF Injection? Ways to Exploit, Examples and Impact
Portable Document Format (PDF) files are the universal standard for sharing documents across different platforms while maintaining consistent formatting. From invoices and bank statements to whitepapers and legal contracts, PDFs are everywhere. However, beneath their static appearance lies a complex internal structure that can be manipulated by attackers. PDF Injection is a critical security vulnerability that occurs when an application improperly handles user input while generating or processing PDF files, allowing an attacker to inject malicious content into the document.
In this guide, we will explore the technical nuances of PDF Injection, the various ways it can be exploited—including Server-Side Request Forgery (SSRF) and Cross-Site Scripting (XSS)—and how organizations can defend against these attacks. Whether you are a developer building a document generation service or a security professional performing reconnaissance, understanding PDF Injection is essential for modern web security.
Understanding the PDF File Structure
To understand how injection works, we must first look at how a PDF is built. Unlike a simple text file, a PDF is a collection of objects organized in a specific hierarchy. A typical PDF file consists of four main sections:
- Header: Identifies the version of the PDF specification the file adheres to (e.g.,
%PDF-1.7). - Body: Contains the objects that make up the document's content, such as text streams, images, fonts, and interactive elements like forms.
- Cross-Reference (XREF) Table: A lookup table that allows the PDF viewer to locate specific objects within the file quickly.
- Trailer: Points the viewer to the XREF table and the "Root" object (the Catalog), which serves as the starting point for rendering the document.
Objects within the body are defined using the syntax [ID] [Generation] obj ... endobj. For example, a simple text object might look like this:
3 0 obj
<< /Type /Page
/Parent 1 0 R
/Resources 2 0 R
/Contents 4 0 R
>>
endobj
PDF Injection occurs when an attacker can insert their own objects or modify existing ones by breaking out of the intended data fields. This is conceptually similar to SQL Injection, where user input escapes a string literal to execute arbitrary commands.
Types of PDF Injection Attacks
PDF Injection is a broad term that covers several different exploitation techniques. These are generally categorized based on where the injection occurs and what the attacker aims to achieve.
1. HTML-to-PDF Injection (Server-Side)
Many web applications offer a "Download as PDF" feature for receipts, profiles, or reports. To implement this, developers often use libraries like wkhtmltopdf, dompdf, or headless browsers like Puppeteer and Playwright. These tools take HTML/CSS as input and render it into a PDF.
If the application takes user-provided data (like a username or a shipping address) and embeds it directly into the HTML template without proper sanitization, an attacker can inject HTML tags.
Example Payload:
If the application renders: <div>Hello, [USERNAME]</div> into a PDF, an attacker might set their username to:
<img src="x" onerror="document.write('Injected Content')">
2. Server-Side Request Forgery (SSRF) via PDF
This is perhaps the most dangerous form of PDF Injection. When a server-side library renders a PDF from HTML, it often attempts to resolve external resources like images, stylesheets, or scripts. If an attacker can inject an <img> or <iframe> tag, they can force the server to make requests to internal network resources that are not publicly accessible.
Example Payload for AWS Metadata:
<iframe src="http://169.254.169.254/latest/meta-data/iam/security-credentials/" width="500" height="500"></iframe>
When the server generates the PDF, it will fetch the AWS credentials and embed them directly into the document, which is then returned to the attacker. This technique can also be used to scan internal ports or access internal administrative panels.
3. JavaScript Injection (Client-Side)
The PDF specification supports a subset of JavaScript (Acrobat JS) to enable interactive features like form validation. An attacker can inject a /JS or /JavaScript entry into the PDF dictionary. When a user opens the malicious PDF in a viewer that supports JavaScript (like Adobe Acrobat or some browser-based viewers), the script executes.
Example PDF Object with JavaScript:
10 0 obj
<< /Type /Action
/S /JavaScript
/JS (app.alert('This PDF is vulnerable to JavaScript injection!');)
>>
endobj
While modern browsers run PDF viewers in a sandbox, this can still be used for phishing, credential harvesting, or triggering vulnerabilities in the PDF reader software itself.
How to Exploit PDF Injection: Technical Walkthrough
Let’s look at a practical scenario involving a vulnerable PHP application using the dompdf library. The application generates an invoice based on a GET parameter.
Vulnerable Code:
<?php
require_once 'dompdf/autoload.inc.php';
use Dompdf\Dompdf;
$dompdf = new Dompdf();
$user_data = $_GET['name']; // Unsanitized input
$html = "<html><body><h1>Invoice for: " . $user_data . "</h1></body></html>";
$dompdf->loadHtml($html);
$dompdf->render();
$dompdf->stream("invoice.pdf");
?>
Step 1: Testing for HTML Injection
An attacker first tests if basic HTML tags are rendered. They might supply name=<b>Test</b>. If the resulting PDF shows the word "Test" in bold, the application is vulnerable to HTML injection.
Step 2: Local File Read (LFI)
Many PDF libraries allow the use of the file:// URI scheme. The attacker can try to read sensitive files from the server's filesystem.
Payload:name=<iframe src="file:///etc/passwd" width="800px" height="1000px"></iframe>
If the library is configured to allow local file access (which many are by default), the contents of /etc/passwd will be rendered inside the PDF.
Step 3: Exfiltrating Data via SSRF
If the server is hosted in a cloud environment, the attacker will target the metadata service.
Payload:name=<link rel="attachment" href="http://169.254.169.254/latest/meta-data/">
This might attach the metadata response as a file attachment within the PDF, depending on the library's capabilities.
The Impact of PDF Injection
The consequences of a successful PDF Injection attack can range from minor annoyances to full system compromise:
- Information Disclosure: Attackers can read local files (like
/etc/passwd,.envfiles, or configuration files) or access internal cloud metadata. - Internal Network Access: Through SSRF, attackers can interact with internal APIs, databases, or management consoles that are protected by a firewall.
- Phishing and Malware Distribution: Malicious PDFs can be used to trick users into visiting phishing sites or to exploit vulnerabilities in their PDF reader to install malware.
- Denial of Service (DoS): An attacker can inject complex or recursive elements (like a "Billion Laughs" style attack for PDF) that consume excessive CPU or memory during the rendering process, crashing the server.
Real-World Vulnerabilities
Several popular PDF libraries have faced PDF Injection vulnerabilities. For instance, wkhtmltopdf has historically been susceptible to SSRF because it uses a full WebKit engine to render HTML. If not properly sandboxed, it treats the local filesystem and the internal network as reachable resources.
Similarly, dompdf had a notable vulnerability where attackers could inject CSS that would lead to the execution of arbitrary PHP code or the reading of local files. These examples highlight that even when using well-known libraries, the security of the implementation depends heavily on configuration.
How to Prevent PDF Injection
Securing PDF generation requires a multi-layered approach focusing on input validation and environment hardening.
1. Sanitize and Encode Input
Treat all user input as untrusted. Before passing data to an HTML-to-PDF converter, strip out dangerous HTML tags (like <script>, <iframe>, <object>, and <link>). Use a dedicated HTML sanitization library like DOMPurify (for JavaScript environments) or HTML Purifier (for PHP).
2. Disable External Resource Loading
Most PDF libraries have settings to disable the loading of remote images, scripts, or stylesheets. If your application only needs to generate text-based PDFs, disable these features entirely.
Example (dompdf configuration):
$options = new Options();
$options->set('isRemoteEnabled', false);
$options->set('isPhpEnabled', false);
$dompdf = new Dompdf($options);
3. Use a Sandbox
Run the PDF generation process in a restricted environment. If you are using Puppeteer, run it inside a Docker container with limited network access. Use tools like AppArmor or Seccomp to restrict the system calls the PDF generator can make.
4. Network Isolation
Ensure the server responsible for PDF generation cannot access sensitive internal network segments. It should not be able to reach the cloud metadata service (169.254.169.254) or internal databases unless absolutely necessary.
5. Content Security Policy (CSP)
If you are using a browser-based generator, implement a strict CSP that prevents the execution of inline scripts and restricts the domains from which resources can be fetched.
Conclusion
PDF Injection is a versatile and dangerous attack vector that leverages the complexity of the PDF format and the power of document generation libraries. By understanding how these tools interpret HTML and JavaScript, attackers can turn a simple "Export to PDF" button into a gateway for data theft and network infiltration.
For developers, the key takeaway is that document generation is not just a UI feature; it is a security boundary. Proper sanitization, disabling unnecessary features, and network-level isolation are the primary defenses against these exploits. As infrastructure grows more complex, staying ahead of these vulnerabilities is critical for maintaining a robust security posture.
To proactively monitor your organization's external attack surface and catch exposures before attackers do, try Jsmon.