Another good example of this is having separate classes for something like unsafe strings vs. safe strings in a web app. The functions which interact with the outside world accept unsafe strings and emit safe strings to the rest of the application. Then the rest of the application only works with safe strings.
Anything that accepts a safe string can make an assumption that it doesn't need to do any validation (or "parsing" in the context of the OP), which lets you centralize validation logic. And since you can't turn an unsafe string into a safe string without sending it through the validator, it prevents unsafe strings from leaking into the rest of the app by accident.
This concept can be used for pretty much anything where you are doing data validation or transformation.
Also a good way to prevent hashed passwords from being accidentally logged.
Class PasswordType(django.db.models.Field):
hashed_pw = CharField()
def __str__():
# you can even raise an Exception here
return '<confidential data>'
Not that you should be trying to log this stuff anyways, but unless you're a solo dev you can't prevent other people from creating bugs, but you can mitigate common scenarios.
What are safe and unsafe strings supposed to mean? All strings seem like normal string to me, a "DELETE * FROM db" is no different from any other string until it's given to a SQL query.
Escaping modes. All strings are not equivalent: "Bobby tables" is very different from "'; drop table users; --".
The idea is to encode the contexts where a string is safe to use directly into the type of the variable, and ensure that functions that manipulate them or send them to outside systems can only receive variables of the proper type. When you receive data from the outside world, it's always unsafe: you don't even know if you've gotten a valid utf8 sequence. So all external functions return an UnsafeString, which you can .decode() into a SafeString (or even call it a String for brevity, since most manipulations will be on a safe string).
Then when you send to a different system, all strings need to be explicitly encoded: you'd pass a SqlString to the DB that's been escaped to prevent SQL injection, you'd pass a JSONString to any raw JSON fragments that's had quotes etc. escaped, you'd pass an HtmlString to a web template that properly escapes HTML entities, and so on. It's legal to do a "SELECT $fields FROM tablename where $whereClause" if $fields and $whereClause are SqlStrings, but illegal if they are any other type of strings. And if you do <a href="$url"> where $url is an UnsafeString, the templating engine will barf at you.
There are various ways to cut down the syntactic overhead of this system by using sensible defaults for functions. One common one is to receive all I/O as byte[], assume all strings are safe UTF-8 encoded text, and then perform escaping at the library boundaries, using functionality like prepared statements in SQL or autoescaping in HTML templating languages. Most libraries provide an escape-hatch for special cases like directly building an SQL query out of text, using the typed-string mechanism above.
That’s the one, thanks for the link! I tried to find it while writing my post but couldn’t for the life of me remember a single thing to even try searching for. I probably last read that article when it was written in 2005.
A safe string is something you got from the programmer (or other trusted source), and an unsafe string is something you got from the network/environment/etc.
Are you genuinely curious, or are you being a troll?
Look at the content of your string, make a decision as to whether you would give it to a SQL engine. If you have not looked, it's presumed unsafe. If you have validated it - parsed it, in the context of this article and this discussion - and decided that you consider it safe, then it is a safe string from that point on.
This isn't a philosophical debate about what "safe" means to humans, it's a programming discussion that says if you only want to pass "select * from reports" to your database, check that's what the string contains before you pass it anywhere.
It's impossible to do SQL-safety validation at any other layer, because otherwise you're making the assertion that someone with the last name "O'Neil" or "Null" (Yes! A real name!) may as well give up and legally change it for the "safety" of programmers that are too lazy to do thing right.
I am really not trying to be a troll. Genuinely don't understand this concept of safe strings.
How could a software even look at text content and determine safeness? There are cases where string input might be limited to just letters or numbers but often it's not. As soon as punctuation or unicode (non English users) is on the table, text is basically anything and there are no general defense from that.
Parsing and static types could have restrictions on string length, min or max value for numbers, how many items in an array, but it cannot make text safe generally-speaking by any meaning of safe. It has no awareness of how the content will be used.
We're not talking about some absolute, metaphysical "safe strings" that guard against every possible flaw, but rather about better supporting an already existing safety check.
If you never thought to write an escaping function in the first, you can't write a SqlString safe type either, obviously. Equally obviously, if you can write an escaping function but you can't write a function that detects a DROP TABLE, then you can write a SqlString type but not a SelectQueryString type.
The idea being discussed here is simply that if you do write an escaping function, its signature should not be (String -> String) or (String -> Boolean) or, God forbid, (String -> void), but something like (String -> SqlString).
This ensures that whatever you feed to your database must have gone through such an escaping function, instead of expecting the programmer to simply remember it. Also prevents you from accidentally escaping a string twice.
(Obligatory pedantic disclaimer: if you're working with modern databases, please don't escape your own strings and just use parameters instead.)
I agree with you. The concepts of a safe string in isolation is too abstract too be meaningful. A correct API, such as interacting with a database only using explicit parameters (instead of string-concatenating to build up a query) is always safe, irrespective of the provenance of the input. The input could be a virus or a DB command and this would still be 100% safe.
What people mean by safe string in more specific contexts however, is meaningful, but the word "safe" is an unfortunate choice. Instead, think "SqlEscapedString" or "HtmlEscapedString" or "UriEscapedString". These are much more meaningful, and their use-case should be obvious. You can convert an arbitrary input "String" type into a "SqlEscapedString" and then safely use simple string concatenation to build up a query. This is useful in situations where non-parameter parts of the query are dependent upon the input in ways that are not safely exposed in the DB query API. For example, building up complicated WHERE clauses or using dynamic table names.
So you can write something like the following (in pseudo code):
String tableName = ParseFromUntrustedPacket( packet );
SqlEscapedString sqlTableName = new SqlEscapedString( tableName );
SqlEscapedString query = SqlEscapedString.Unsafe( "SELECT * FROM " ) &
sqlTableName &
SqlEscapedString.Unsafe( " WHERE Foo is NOT NULL" );
var result = connection.Execute( query );
The benefit of this kind of approach is that if that last function call has the signature of "Execute( SqlEscapedString q )", then it is basically impossible to accidentally pass an unescaped (unsafe) input string into it by accident. At every step, the developer is forced to make a decision to either pass in a potentially dangerous query snippet using "Unsafe(...)" or to make input strings safe by escaping them.
Similarly, this method converts Strings into a different type when escaping them, making it (almost) impossible to accidentally double-escape inputs, which is an issue commonly seen in some environments such as complex shell scripts.
ASP.NET for example does something similar with IHtmlString.
Oh, then you're reading too much into "safe" and assuming it means "can never do any bad if used in any situation, must need an AI".
It's like the same way a software can look at a number that's going to control a water heater and determine whether it's a safe temperature for a human body or not. You the programmer chose some limits. When the user enters a number, it's an unsafe value by default, because you haven't validated it.
After you validate it, you have something which is 'safe' to pass around to anywhere in your code, like a security checkpoint says that random people are unsafe, and when they enter a building their details are checked, and then they are OK to enter and go anywhere inside the building.
You, the programmer, choose what things you consider safe and unsafe and those words mean validated or unvalidated, verified or unverified, checked or unchecked, approved or unapproved, known or unknown, outside or inside, or any other pair.
> it cannot make text safe generally-speaking by any meaning of safe
If something can't be done, ever, in any situation, that probably isn't what people are talking about doing.
The point that's being made here is if you make safe and unsafe strings separate types, in a strongly-typed system, it is impossible to use an unsafe string where a safe string is expected or vice versa. When you have a boundary function that turns an unsafe string into a safe string (e.g., escaping), or that rejects strings that are not safe, you can have a system where all the inputs are unsafe and are forced to go through such a mechanism exactly once to guarantee freedom from double-escaping issues.
I think the above definition of "gets turned into safe strings early" isn't necessarily a clear one.
The general idea is to separate strings into different types, with different rules. E.g. a HTML templating engine will always escape strings unless they're of a specific type (e.g. in Python a popular implementation calls the type "MarkupSafe") that says it's ok to include as raw HTML (e.g. because it's the output of a sanitizer), an SQL query builder will only accept specially tagged strings as non-parameters into queries, ..., which reduces the likelihood of the programmer accidentally using a string in a place where it isn't correct to use. Username field doesn't have any special rules attached? All code will reject unsafe use as far as possible.
Depends on context: for content, only < needs to be escaped, within a tag (but not an attribute) > needs to escaped, within an attribute quotes of the same kind that started the attribute value (if any) must be escaped. Then there are legitimate cases of richly formatted user input/markup where you want to restrict script or block-level elements, or elements that can reach out to a container element such as a paragraph or section. I could go on here, but the point is to use HTML-aware template engines and markup processors, not rely on magic escaping routines.
A string that is supposed to represent a ”name” in a web app context — safe or not? I am referring to potential SQL injections.
Surely, no names contain semicolons, but is the business logic-part of your app to determine that names which only contain A-Za-z (or whatever) are safe?
It is contextually dependent, meaning in practice as close to the actual SQL query as possible. Or call to file system, where dots are unsafe, and so on.
Static typing helps here as described in the article.
Anything that accepts a safe string can make an assumption that it doesn't need to do any validation (or "parsing" in the context of the OP), which lets you centralize validation logic. And since you can't turn an unsafe string into a safe string without sending it through the validator, it prevents unsafe strings from leaking into the rest of the app by accident.
This concept can be used for pretty much anything where you are doing data validation or transformation.