Redaction of Potentially Sensitive Data from
Mail Abuse Reports
Return Path
100 Mathilda Place, Suite 100
SunnyvaleCA94086USietf@cybernothing.orghttp://www.returnpath.net/Cloudmark
128 King St., 2nd Floor
San FranciscoCA94107USmsk@cloudmark.com
Applications
MARF Working GroupARFMARFfeedback loopspam reporting Email messages often contain information that
might be considered private or sensitive, per
either regulation or social norms. When such a
message becomes the subject of a report intended
to be shared with other entities, the report
generator may wish to redact or elide the
sensitive portions of the message. This memo
suggests one method for doing so effectively. defines a message format for
sending reports of abuse in the messaging
infrastructure, with an eye toward automating
both the generating and consumption of those
reports. For privacy considerations it might be the policy
of a report generator to anonymize, or obscure,
portions of the report that might identify an end
user who caused the report to be generated.
This has come to be known in feedback loop
parlance as "redaction".
Precisely how this is done is unspecified in
as it will generally be a
matter of local policy. That specification
does admonish generators against being too
over-zealous with this practice, as obscuring too
much data makes the report non-actionable. Previous redaction practices, such as replacing
local-parts of addresses with a uniform string
like "xxxxxxxx", frustrated any kind of
prioritizing or grouping of reports. This memo
presents a practice for conducting redaction in
a manner that allows a report receiver to
detect that two reports were caused by the same
end user without revealing the identify of
that user. That is, the report receiver can use
the redacted string, such as an obscured email
address, to determine that two such unredacted
strings were identical; the reports originally
contained the same address. Generally, it is assumed that the
recipient-identifying fields of a message, when
copied into a report, are to be obscured to protect
the identity of the end user who submitted the
complaint about the message. However, it is also
presumed that other data will be left intact, and
those data could be correlated against log files
or other resources to determine the intended
recipient of the original message. The key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
"RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in
. When redacting of reports is desired, in order
to enable a report receiver to correlate reports
that might refer to a common but anonymous source,
the report generator SHOULD use the following
practice:
Select a transformation mechanism (see
) that is
consistent (i.e., the same input string
produces the same output each time) and
reasonably collision-resistant (i.e., two
different inputs are unlikely to produce
the same output). Identify string(s) (such as local-parts
of email addresses) in a message that need
to be redacted. Call these strings the
"private data". For each piece of private data, apply
the selected transformation mechanism. If the output of the transformation
can contain bytes that are not printable
ASCII, or if the output can include
characters not appropriate to replace the
private data directly, encode the output
with the base64 algorithm as defined in
Section 4 of , or
some similar translation to a form valid
replacement in the original context. For
example, replacing a local-part in an email
address with transformation output
containing an "@" character (ASCII 0x40)
or a space character (ASCII 0x20) is not
permitted by the specification for
local-part (), so the
transformation output needs to be encoded
as described. Replace each instance of private data with
the corresponding (possibly encoded)
transformation when generating the
report. Note that the replaced text could
also be in a context that has constraints
such as length limits that need to be
observed. This has the effect of obscuring the data (in a
potentially irreversible way) while still allowing
the report recipient to observe that numerous
reports are about one particular end user. Such
detection enables the receiver to prioritize its
reactions based on problems that appear to be
focused on specific end users that may be
under attack. This memo does not specify a particular
transformation mechanism as a requirement.
The interoperability that this memo seeks to
provide is enabled by the consistency of the
transformation. The issue of the security of the transformation,
frustrating attempts to reverse the transformation,
is a matter of local policy. A continuum of
possible transformations exists, from trivial
ones such as rot13, CRC32 and base64, through
strong cryptographic encodings such as
and even full encryption, or
private transformations such as mapping an email
address to an internal customer number. An
operator wishing to perform report redaction needs
to select a consistent transformation that
obscures the private data and is resilient to
attempts to extract the original data to the
extent required by local policy, keeping in mind
that the environment in which the transformation is
operating is not a highly secure one. See
for further
details of this issue. An implementation MAY choose any transformation
that has a reasonably low likelihood of
collision. General security issues with respect to these
reports are found in . Message digest collisions are a well-understood
issue. Their application here involves a
report receiver improperly concluding that two
pieces of redacted information were originally the
same when in fact they are not. This can lead
to a denial of service, where the inadvertently
improper application of complaint data causes
unjustified corrective action. Such cases are
sufficiently unlikely as to be of little
concern. Although the identity of the user causing a report
to be generated can be obscured using this
mechanism, other properties of a message (such as
the Message-ID field) that are not redacted could
be used to recover the original data by locating
them in the message logs of the originating system
or via other data correlation techniques. It is
incumbent on the report generator to anticipate
and redact or otherwise obscure such data, or
accept that such recovery is possible even from
the very simplest kinds of feedback. It is for this reason that the normative portions
of this memo do not include stronger assertions
about cryptography used in the transformation.
Given the ultimate recoverability of the redacted
information, the cryptographic strength of the
transformation is not a critical security
measure. The process of redacting a feedback report
satisfies a privacy requirement established by
local policy, and is not meant to provide strong
security properties. and Section 8 of
discuss topics related to
establishment of bilateral agreements between
report producers and consumers. The issues
raised here are also things to be considered when
establishing such agreements. While the method of redaction described in this
document may reduce the likelihood of some types
of private data from leaking between ADMDs, it is
extremely unlikely that report generation software
could ever be created to recognize all of the
different ways that private information could be
expressed through human written language. If
further protections are required, implementers may
wish to consider establishing some sort of
out-of-band arrangements between the relevant
entities to contain private data as much as
possible. This memo includes no request to IANA. [RFC Editor note: This section may be removed prior
to publication.] An Extensible Format for Email
Feedback Reports
The Base16, Base32, and Base64
Data Encodings
SJD
Key
words for use in RFCs to
Indicate Requirement
Levels
Harvard University
Complaint Feedback Loop
Operational Recommendations
Messaging Anti-Abuse
Working Group
HMAC: Keyed-Hashing for
Message Authentication
IBM
UCSD
IBM
Simple Mail Transfer
Protocol Assume the following input message:
On receipt, bob@example.net reports this message as abusive
through whatever mechanism his mailbox provider has
established. This causes an message to
be generated. However, example.net wishes to obscure Bob's
email address lest it be relayed to the offending agent, which
could lead to more trouble for Bob. Thus, example.net plans to redact the local-part of the recipient
address in the To: field. Local policy and security requirements
suggest the algorithm known as "H" (a hash of a key concatenated
with the data to be obscured) using SHA1 is adequeate. It has
thus selected a redaction key of "potatoes", and the private data
in this case is the string "bob". The concatenation of
"potatoesbob" is digested with SHA1 and then base64-encoded to the
string "rZ8cqXWGiKHzhz1MsFRGTysHia4=". Therefore, when constructing the ARF message in response to Bob's
complaint, the following form of the received message is used in
the third part of the ARF report:
Note, however, that it is possible the redacted information can
be recovered by agents at example.com searching their logs for
the original envelope associated with the message, by correlating
with the Message-ID contents which were not redacted here. It
is expected that feedback loops generating such reports involve
senders that have been vetted against such information
leakage. Much of the text in this document was initially
moved from other MARF working group documents,
with contributions from Monica Chew, Tim Draegen,
Michael Adkins, and other members of the Messaging
Anti-Abuse Working Group. Additional feedback was
provided by John Levine, S. Moonesamy, Alessandro
Vesely, and Mykyta Yevstifeyev.