Allowing date uncertainty
In my last post on the awkwardness of dates in JavaScript and more generally, I described a problem I wanted to address: I would like to see a notation that allows me to capture a date to just the precision I need and at the same time note the imprecision left over. In this post, I propose a way to handle this.
Recapping example
Last post, I talked about “Dickie dates”. This was in reference to the author of Mafia Republic, the book I used as an example of a source of a variety of date references. From that book, I learned that there was an ‘ndrangheta shooting at a restaurant in Duisberg, Germany on 15th August 2007. I could probably even find out the time of the shooting if I looked at other sources.
But there are other types of date-references in the book that I nonethless have wanted to capture alongside the more precisely dated events like that shooting.
For example, Dickie refers to a number of isolated kidnapping events that the ‘ndrangheta carried out. He gives the dates and details of a few examples, probably the most famous or grisly ones. But he sums up all the others by saying there was a “pattern of kidnappings” in the 1960s in general 1.
Later in the same book I learned that there was a “seemingly interminable sequence of feuds over local turf” in the 1980s2. Again, I have no specifics but I have the shape or rough outline of the events. If I want to capture this in data and later visually, I could research each feud and plot it on a timeline and that would give me a more exact shape.
But in the absence of that information, I still want to acknowledge even the vague shape of Dickie’s claim. Even if I do find more exact data later, I will then be able to check the veracity of Dickie’s claim against what I have.
Towards ambiguity
I mentioned last post that I had already started to make a rough notation whereby I would simply give the characters of the date which I could provide. I followed a sort-of ISO-format date to do this, a delimiter-free YYYYMMDDhhmmss
pattern.
So whereas the event in Duisberg on 15th August, 2007 would be 20170815
, that “seemingly interminable sequence of feuds” in the 1980s that Dickie mentioned would simply be represented by 198
.
What I’ve come to realise is that the length of the string given – its relative completeness – could be an index of precision or imprecision. So in the example of the Duisberg shooting, I could say there’s a precision index of 8, for example, based literally on the length of the string. But in the example of the sequence of feuds, I could say there’s only a precision index of 3, much lower.
The precision index as described is, ironically, not that precise. But it is an indicator that I could use and may give me a model to refine in future.
Requirements
The requirements here are relatively simple although may be difficult to implement:
- The function should check the date according to a strict format (like the ISO convention for combined date-time formatting).
- The function should, however, allow the date to be incomplete.
- The function should assume that the date indicates a range and provide a start date and an end date based on that assumption. (In the case of a single event at a very specific moment, the start and end date will be the same.)
- The function should autocomplete any components of an incomplete date that are missing, following back or moving forward based for start and end date returns respectively.
- The function should also return a precision index for the given date.
In order to achieve this, I expect the function to accept a string or an object containing a string or two strings and to return an object with a start date, an end date and a precision index.
Implementation
After reading Burt Hufnagel’s piece in 97 Things Every Programmer Should Know I was curious to know if I could achieve some of the above in a single regular expression. It would be terse and difficult to write and read but I thought it might help with function composition later on because the pattern itself would be reusable.
Hufnagel’s own regular expression was written simply to check 12-hour times:
(0[1-9]|1[0-2]):[0-5][0-9]:[0-5][0-9] ([AP]M)
My own would have to be not just for 24 hour time but also have a preceding date component. My first attempt was simply to try and write something that would check the format was correct:
\d{4}(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])(?:[0-1][0-9]|2[0-3])[0-5][0-9][0-5][0-9]
A visual representation of this is probably easier to read.
You’ll notice I’ve kept away from punctuation or delimiters which are common to the ISO-format. Instead I’ve assumed that every year will have four digits, every month, day, hour, minute and second represented will each have two.
The parentheses are simply to group together the possible variances in dates, hence the group exclusion flag ?:
.
This code will check that the date follows the format strictly but will not allow for incompleteness.
After a little fiddling with this, I came up with this a pattern with a set of nested groups, with all the relevant parts captured.
(\d{1,4})(?:(0[1-9]|1[0-2])(?:(0[1-9]|[1-2][0-9]|3[0-1])(?:([0-1][0-9]|2[0-3])(?:([0-5][0-9])(?:([0-5][0-9]))?)?)?)?)?
Again the visualisation probably makes more sense, as you can see the groupings.
A table will help make sense of the railroad diagram here.
Group | Date component |
1 | Year |
2 | Month |
3 | Day |
4 | Hour |
5 | Minute |
6 | Second |
While the pattern tries to exclude incorrect dates and times – so one can’t specify the 32nd March for example – it still isn’t exact enough to exclude seemingly allowable dates like 30th February, which even in a Leap Year is impossible.
Parsing dates
Assuming a date passes this validity check it’s then very easy to complete whatever is missing from the date.
We can use each of the capture groups in the Regex pattern to specify a part of the date or to fall back to a default when they are missing or incomplete.
Let’s create a function parseDate
that takes one argument, date
.
var parseDate = function (date) {
// function body to be inserted
}
At the top of this function we can specify our regular expression and execute it on the argument provided.
var parseDate = function (date) {
var datePattern = /(\d{1,4})(?:(0[1-9]|1[0-2])(?:(0[1-9]|[1-2][0-9]|3[0-1])(?:([0-1][0-9]|2[0-3])(?:([0-5][0-9])(?:([0-5][0-9]))?)?)?)?)?/
var dateResult = datePattern.exec(date)
// rest of function body to be written
}
Following that it’s relatively easy to write some code that uses the object returned by dateResult
and deals with each of the parts of the date in the pattern.
var startYear = dateResult[1]
? `${dateResult[1]}${"0000".substr(dateResult[1].length)}`
: ""
var startMonth = dateResult[2] || "01"
var startDay = dateResult[3] || "01"
var startHour = dateResult[4] || "00"
var startMinute = dateResult[5] || "00"
var startSecond = dateResult[6] || "00"
So startYear
for example takes the value of the first group, checks it is not falsy (that is, not an empty string or null
or undefined) and returns an empty string if it is. If the value of dateResult[1]
is truthy though, it uses that and then, in case, the string captured is not the expected four digits long, fills the rest with zeros.
But 198
, which I used earlier to represent the 1980s, is filled out to become 1980
. Because there are no month, day, or time values beyond that 198
(which the pattern has already checked), the variables that follow will all then default to the first possible value in the range for each case. So the month value becomes 01
because 00
is not possible, as does the day value, but the hour, minute and second values can all fall to 00
.
Combining all these parts, we get the start date we want.
var startDate = `${startYear}${startMonth}${startDay}${startHour}${startMinute}${startSecond}`
So 198
becomes 19800101000000
.
The code for producing the end date is slightly different and more complicated because the last possible value in the range differs based on the values of other parts of the day.
The year is easy enough and looks very much like what we did for the start date.
var endYear = dateResult[1]
? `${dateResult[1]}${"9999".substr(dateResult[1].length)}`
: ""
So 198
would here give an endYear
of 1989 which does indeed mark the end of the 1980s.
However, the day given in any date could end in one of four ways. Most of the time, either the number 30
or the number 31
will mark the end of the month. But for most Februarys, the number would be 28
and every four years of course we have 29
.
This results in the following code:
var endDay = dateResult[3]
|| (dateResult[2] === "02")
? (dateResult[1] % 4 === 0)
? "29"
: "28"
: (/(04|06|09|11)/.test(dateResult[2]))
? "30"
: "31"
The logic of what happens here can be explained in a nested bulleted list.
- First, the non-falsiness of the day component is checked.
-
- If the day component is truthy, we go with that.
- If the day component is falsy then the month is checked.
- If the month is February then the year is checked.
- If the year is divisible by four, then
"29"
is returned. - If the year is not divisible by four, then
"28"
is returned.
- If the year is divisible by four, then
- If the month is not specifically February, then the month is checked further.
- If the month is April, June, September, or November, then
"30"
is returned. - If the month is not any of those, then
"31"
is returned.
- If the month is April, June, September, or November, then
- If the month is February then the year is checked.
We can put these parts together to get a parseDate
function that will take a string and put out an object with a start date and an end date.
var parseDate = function (date) {
var datePattern = /(\d{1,4})(?:(0[1-9]|1[0-2])(?:(0[1-9]|[1-2][0-9]|3[0-1])(?:([0-1][0-9]|2[0-3])(?:([0-5][0-9])(?:([0-5][0-9]))?)?)?)?)?/
var dateResult = datePattern.exec(date)
var startYear = dateResult[1]
? `${dateResult[1]}${"0000".substr(dateResult[1].length)}`
: ""
var startMonth = dateResult[2] || "01"
var startDay = dateResult[3] || "01"
var startHour = dateResult[4] || "00"
var startMinute = dateResult[5] || "00"
var startSecond = dateResult[6] || "00"
var endYear = dateResult[1]
? `${dateResult[1]}${"9999".substr(dateResult[1].length)}`
: ""
var endMonth = dateResult[2] || "12"
var endDay = dateResult[3]
|| (dateResult[2] === "02")
? (dateResult[1] % 4 === 0)
? "29"
: "28"
: (/(04|06|09|11)/.test(dateResult[2]))
? "30"
: "31"
var endHour = dateResult[4] || "23"
var endMinute = dateResult[5] || "59"
var endSecond = dateResult[6] || "59"
var startDate = `${startYear}${startMonth}${startDay}${startHour}${startMinute}${startSecond}`
var endDate = `${endYear}${endMonth}${endDay}${endHour}${endMinute}${endSecond}`
return {
startDate: startDate,
endDate: endDate
}
}
However, this doesn’t yet fulfil the requirements. For a start there’s no precision index returned. On top of that, the function will only take a single string but it needs to be able to take an object too.
Accepting start and end dates
Some small adjustments are required to allow for not just a string as input but an object containing startDate
and endDate
.
For a start, some is needed to work out how to handle the date:
var dateInput = {}
if (typeof date === "string") {
dateInput.startDate = date
dateInput.endDate = date
} else if (typeof date === "object") {
dateInput.startDate = date.startDate
dateInput.endDate = date.endDate || date.startDate
}
If only one date is provided (as in a string input) it is used for both. That way the range is inferred from the precision or imprecision of the date given.
Next, the references to date are replaced with either dateInput.startDate
or dateInput.endDate
.
var startDate = datePattern.exec(dateInput.startDate)
var endDate = datePattern.exec(dateInput.endDate)
And the references to dateResult
are thereby replaced with either startDate
or endDate
.
Precision index
We know from the work put into the regular expression that the last non-null capture group returned will give an indication of the specificity or precision of the date provided.
So it should be enough to work back from the last group, which indicates the number of seconds in the string given if any, knocking the index down as we go.
var startPrecision = 6
while (startPrecision) {
if (startDate[startPrecision]) {
break
}
startPrecision -= 1
}
As the variable declaration startPrecision
indicates, there needs to be a precision index for each type of date. So the same must be done for endDate
and then the two can be added together.
var endPrecision = 6
while (endPrecision) {
if (startDate[endPrecision]) {
break
}
endPrecision -= 1
}
var precision = startPrecision + endPrecision
Remember that if only one date is given then it is duplicated across startDate
and endDate
so the precision of one date is effectively doubled, which I think is a fair reflection of what the precision index should do.
Gaps and problems
There is an issue with the regular expression which I mentioned earlier, around apparently valid but actually invalid dates creeping in (like 30th February). Aside from this though, I feel this code gives me a good start in working with the data.
I will probably want to refactor and break it up at some point.
I also need to build tests around this to ensure it continues to work if it needs developing further.
Next
The next step following on from this attempt to develop a convention around imprecisely given dates is to integrate it back into the data visualisation work that started off the investigation.
–