Regexps

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

—Jamie Zawinski

A language for patterns

a - match a

ab - match a followed by b

a* - match a 0 or more times

a+ - match a 1 or more times

a? - match a 0 or 1 times

a{n} match a n times.

a{n,m} match a between n and m times, inclusive

Character classes

. - any character

[abc] - one of the characters 'a', 'b', or 'c'.

\s - any whitespace character

\S - any non-whitespace character

\w - any word character (letter or digit)

\d - any digit character

For all the details

java.util.regex.Pattern

Writing rexeps in Java

When we write regexp expressions containing \ in Java strings we need to escape the \. So "\\s" rather than "\s".

Yeah, that’s annoying.

String.split

"abc def  xyz".split("\\s+") ⟹ { "abc", "def", "xyz" }

Split the string at each place the regexp matches.

String.match

"510-867-5309".matches("\\d{3}-\\d{3}-\\d{4}") ⟹ true

Test whether the whole string matches the regexp.

java.util.regex.*

Pattern p = Pattern.compile("\\d{3}-\\d{3}-\\d{4}");
Matcher m = p.matcher("510-867-5309");
m.matches() ⟹ true

Pattern is a thing that defines a pattern we want to use in matching.

Matcher is an object that combines a Pattern with an actual String to match against.

Finding rather than matching

// Look for successive matches of the pattern
while (m.find()) {
  System.out.println(m.group()); // prints what matched
}

The Matcher is a very stateful object. Each call to find searches for the next occurrence of the pattern and m.group() returns the text of the last match.

Groups

Pattern p = Pattern.compile("(\\d{3})-(\\d{3})-(\\d{4})");
Matcher m = p.matcher("510-867-5309");

m.matches() ⟹ true
m.group() ⟹ "510-867-5309"
m.group(1) ⟹ "510"
m.group(2) ⟹ "867"
m.group(3) ⟹ "5309"

Parenthesized sections of the pattern create “capture groups” that we can use to extract parts of what matched.

Using the Matcher with streams

m = p.matcher(someBigBlobOfText);
Stream<MatchResult> rs = m.results();
List<String> numbers = rs.map(MatchResult::group).toList();

The MatchResult object has the same methods as Matcher for getting the results of the match such as group.

The code above computes a list of all the telephone numbers in someBigBlobOfText.

And now for something really dumb …

Checking whether a unary number is prime with a regexp

First toUnary

public String toUnary(int n) {
  return "1".repeat(n);
}

Translates a Java int to a unary number represented as a String.

toUnary(5) ⟹ "11111"

Now the prime check

public boolean isPrime(String num) {
  return !num.matches(".?|(..+)\\1+");
}

If we wanted this to be more efficient, we’d use Pattern.compile to prepare the pattern once, outside the method.

But if we wanted this to be efficient we wouldn’t be doing it this way at all because it’s ridiculous.

Use it

jshell> IntStream.range(0, 20)
   ...>   .filter(n -> isPrime(toUnary(n)))
   ...>   .forEach(System.out::println)
2
3
5
7
11
13
17
19