Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
991 views
in Technique[技术] by (71.8m points)

regex - Extracting breakpoints with intervals closed on the left

I'm looking at the example menu of the command cut() (example(cut)), specifically this part:

cut> aaa <- c(1,2,3,4,5,2,3,4,5,6,7)

cut> cut(aaa, 3)
[1] (0.994,3] (0.994,3] (3,5]     (3,5]     (3,5]     (0.994,3]
[7] (3,5]     (3,5]     (3,5]     (5,7.01]  (5,7.01] 
Levels: (0.994,3] (3,5] (5,7.01]

cut> cut(aaa, 3, dig.lab = 4, ordered = TRUE)
[1] (0.994,2.998] (0.994,2.998] (2.998,5.002] (2.998,5.002]
[5] (2.998,5.002] (0.994,2.998] (2.998,5.002] (2.998,5.002]
[9] (2.998,5.002] (5.002,7.006] (5.002,7.006]
Levels: (0.994,2.998] < (2.998,5.002] < (5.002,7.006]

cut> ## one way to extract the breakpoints
cut> labs <- levels(cut(aaa, 3))

cut> cbind(lower = as.numeric( sub("\((.+),.*", "\1", labs) ),
cut+       upper = as.numeric( sub("[^,]*,([^]]*)\]", "\1", labs) ))
     lower upper
[1,] 0.994  3.00
[2,] 3.000  5.00
[3,] 5.000  7.01

Where the intervals are closed on the right (as shown above), then it shows me a way to extract the breakpoints of the data using cbind()

Now, let's suppose my data will by cut, but indicating that the intervals are closed on the left.

cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)

How can I extract now my breakpoints using the same command cbind()? (If there are more ways, you're welcome)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Just use something like the following for your pattern, and use gsub instead: "\[|\]|\(|\)".

An example.

out <- levels(cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE))
gsub("\[|\]|\(|\)", "", out)
# [1] "0.994,2.998" "2.998,5.002" "5.002,7.006"

And, here's a quick way to read that data in:

read.csv(text = gsub("\[|\]|\(|\)", "", out), header = FALSE)
#      V1    V2
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006

FYI: The same pattern would work whether the intervals are closed on the left or on the right. Using your original example:

labs <- levels(cut(aaa, 3))
labs
# [1] "(0.994,3]" "(3,5]"     "(5,7.01]" 
read.csv(text = gsub("\[|\]|\(|\)", "", labs), header = FALSE)
#      V1   V2
# 1 0.994 3.00
# 2 3.000 5.00
# 3 5.000 7.01

As for alternatives, since you just need to strip out the first and last character before you can use read.csv, you can also easily use substr without having to fuss with regular expressions (if that's not your thing):

substr(labs, 2, nchar(labs)-1)
# [1] "0.994,3" "3,5"     "5,7.01" 

Update: A totally different alternative

Since it is obvious that R has to calculate these values and store them as part of the function in order to generate the output you see, it is not too difficult to manipulate the function to get it to output different things.

Looking at the code for cut.default, you'll find the following as the last few lines:

if (codes.only) 
    code
else factor(code, seq_along(labels), labels, ordered = ordered_result)

It's really easy to change the last few lines to output a list that contains the output of cut as the first item, and the calculated ranges (from the cut function directly, not extracting from the pasted together factor labels.

For instance, in the Gist I've posted at this link, I've changed those lines as follows:

if (codes.only) 
  FIN <- code
else FIN <- factor(code, seq_along(labels), labels, ordered = ordered_result)
list(output = FIN, ranges = data.frame(lower = ch.br[-nb], upper = ch.br[-1L]))

Now, compare:

cut(aaa, 3)
#  [1] (0.994,3] (0.994,3] (3,5]     (3,5]     (3,5]     (0.994,3] (3,5]     (3,5]    
#  [9] (3,5]     (5,7.01]  (5,7.01] 
# Levels: (0.994,3] (3,5] (5,7.01]
CUT(aaa, 3)
# $output
# [1] (0.994,3] (0.994,3] (3,5]     (3,5]     (3,5]     (0.994,3] (3,5]     (3,5]    
# [9] (3,5]     (5,7.01]  (5,7.01] 
# Levels: (0.994,3] (3,5] (5,7.01]
# 
# $ranges
#   lower upper
# 1 0.994     3
# 2     3     5
# 3     5  7.01

And, right = FALSE:

cut(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
#  [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
#  [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)
CUT(aaa, 3, dig.lab = 4, ordered = TRUE, right = FALSE)
# $output
#  [1] [0.994,2.998) [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002)
#  [6] [0.994,2.998) [2.998,5.002) [2.998,5.002) [2.998,5.002) [5.002,7.006)
# [11] [5.002,7.006)
# Levels: [0.994,2.998) < [2.998,5.002) < [5.002,7.006)

# $ranges
#   lower upper
# 1 0.994 2.998
# 2 2.998 5.002
# 3 5.002 7.006

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...