XML External Entities, Attack and Defence

XML is used widely in many different areas of computing. It's been wildly successful especially compared to its more complex sibling SGML. Most people think of XML as just a bunch of tags and some text, which is normally a perfectly reasonable way to regard it. Unfortunately when you're working with XML data that originates from an untrusted source there are some gotchas waiting to bite you.

A facility of XML that is often overlooked is the ability of a document to refer to external resources. These resources are then loaded by the document parser when it tries to parse the file. This facility is generally used to allow you to split large documents such as books written using Docbook into manageable chunks like chapters rather than having them as one big file. Unfortunately, it is also a common source of security problems.

First, let's make a simple XML document that uses an external entity so we can see how it works. Our XML document is as follows:

<?xml version="1.0"?>
<!DOCTYPE test [ <!ENTITY test SYSTEM "hello-world.txt">]>
<root>
&test;
</root>
  

This document defines an external entity called test who's content will be loaded from the file hello-world.txt. We'll parse this XML file using the following python script. It just parses the file, then dumps the results back out again.

#!/usr/bin/python

import sys
from lxml import etree

def load_doc(filename):
    f = open(filename)
    try:
        parser = etree.XMLParser()
        doc = etree.parse(f, parser)
        return doc
    finally:
        f.close()

def dump_doc(doc):
    return etree.tostring(doc)

if __name__ == '__main__':
    doc = load_doc(sys.argv[1])
    print dump_doc(doc)
  

If we run the script over our XML file then we get the expected results - the hello-world.txt file has been loaded, and its content is in the resulting output.

<root>
Hello World

</root>
  

Now, this is all fine when we can trust the XML we're parsing, but what if we can't? There are many online services and protocols that use XML and they can't count on the documents they're parsing to play nicely. If an attacker creates a malicious document then they can trivially ask the document to include sensitive data and in many cases this data will be visible to them. It is worth restating again, that the data is included when the document is parsed, so if your server-side code is parsing the XML then that's where the inclusion happens. An example of a malicious entity might be one that gains access to the password file of a UNIX system (so the attacker can then brute force the ssh server):

<?xml version="1.0"?>
<!DOCTYPE test [ <!ENTITY test SYSTEM "/etc/passwd">]>
<root>
&test;
</root>
  

Of course, XML was designed during the Internet age, so it doesn't really think in terms of files, it thinks in terms of URLs... Uh oh. Yes, the external entities can reference data anywhere on the Internet. To show this, I'll use xmllint a tool for parsing XML to check for errors that's build on the widely used libxml2 library. I'll pass the noent option to it to tell it to resolve the entities, but you should bear in mind that many parsers will do this by default.

xmllint --noent --format use-entity-url.xml
  

Our entity definition for this example is below, and xmllint will quite happily go off and download the text file. As you can see from the name of the server, this URL wouldn't normally be accessible to the attacker.

<!ENTITY test SYSTEM "http://internalserver/mytext.txt">
  

Again, it's important to remember where this is happening - it's all being done by the server that is processing the XML. Unfortunately, that server is often going to be behind your firewall. This means that unless you've got defence in depth, then parsing this document can access your internal network! An attacker can use these attacks to access internal servers that are behind your firewall and to port scan your internal network.

Port scanning the internal network is accomplished by making lots of requests and comparing the results. Commmonly you can determine from changes in the error messages or timing difference if the address and port are valid. Of course, the attacker has to make some guesses about the addressing scheme you use internally but this isn't much of a barrier.

<!ENTITY test SYSTEM "http://192.168.0.1:8080/">
  

The attacker can even access the loopback interface to make your server pull in data from other ports. This can allow access to services that only listen locally and would normally be inaccessible from the network.

<!ENTITY test SYSTEM "http://127.0.0.1:1234/">
  

Finally, the attacker could trick your server into launching attacks against elsewhere on the Internet. The attack will be coming from your IP address so you'd get the blame!

<!ENTITY test SYSTEM "http://www.example.com/private/somedata.txt">
  

Externel entity attacks aren't new by any means, Westpoint found flaws of this kind during penetration tests performed years ago, and we weren't the first to do so. Unfortunately the problem is still happening today - indeed Facebook paid a bounty to someone who found one of these flaws in their OpenID implementation just last month.

So, what can we do about this problem? Well, the answer is that we need to either disable external entities entirely or at least gain some control over what they're allowed to access. Fortunately most modern XML libraries offer this control and even for older ones there are workarounds.

If I attempt to load an entity over the network with modern versions of the python lxml library I used in the example earlier then the default is to raise an exception. Sadly, this still leaves us open to the problem of including sensitive local files, but it's a step forward.

We can eliminate the problem entirely by implementing a custom entity resolver that rejects all attempts to use external entities. The safe version of the code creates a custom subclass of etree.Resolver then tells the XML parser to use it.

#!/usr/bin/python

import sys
from lxml import etree

class EntityResolverException(Exception):
    pass

class SafeResolver(etree.Resolver):
    '''
    An entity Resolver that rejects all attempts to use entities.
    '''
    def resolve(self, system_url, public_id, context):
        raise EntityResolverException('Attempt to access entity %s' %
    system_url)

def load_doc(filename):
    f = open(filename)
    try:
        parser = etree.XMLParser()
        parser.resolvers.add( SafeResolver() )
        doc = etree.parse(f, parser)
        return doc
    finally:
        f.close()

def dump_doc(doc):
    return etree.tostring(doc)

if __name__ == '__main__':
    doc = load_doc(sys.argv[1])
    print dump_doc(doc)
  

The problem of external entity attacks is not specific to any particular programming language or XML parser, it's a feature of any fully compliant XML implementation. Whenever you're parsing untrusted documents, you should check the behaviour of your XML library. Unfortunately, despite the dangers, the defaults of many parsers leave you exposed to attack.

< Previous Article
What Does Equality Mean?
Next Article >
Understanding the Heartbleed Proof of Concept

OTHER STORIES

What Does Equality Mean?

Comparing two URLs for equality doesn't sound like a complicated problem, but there is actually more to it than you would expect. This post shows how this simple task can lead to some surprising behaviour.

Read more
Designed & Built by e3creative