Rectangle 27 0

An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:

.[\(

The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.

What you are seeing is the result of invoking undefined behaviour - anything goes.

If you want reliable, portable results, you will have to eliminate the empty '()' notations.

Yeah, I think the best choice is to avoid using (). Although my system does define the behavior I wanted in its re_format(7) man page, the thing to do is stick to POSIX. Thanks for digging up the reference.

c - expected behavior of posix extended regex: (()|abc)xyz - Stack Ove...

c regex posix
Rectangle 27 0

If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.

Iterating over all matches gives me [3,6), [3,3), and [3,3). The first one is the match for the regex as a whole according to the regexec man page.

c - expected behavior of posix extended regex: (()|abc)xyz - Stack Ove...

c regex posix
Rectangle 27 0

Ok, did it with libpcre:

#include <pcre.h>
#include <locale.h>

....

        const char *error;
        int   erroffset;
        pcre *re;
        int   rc;
        int   i;
        int   ovector[100];
        char *regex = "([a-zA-Z]{18,20})";
        re = pcre_compile (regex,          /* the pattern */
                        PCRE_MULTILINE|PCRE_DOTALL|PCRE_NEWLINE_ANYCRLF,
                        &error,         /* for error message */
                        &erroffset,     /* for error offset */
                        0);             /* use default character tables */
        if (!re)
        {
                printf("pcre_compile failed (offset: %d), %s\n", erroffset, error);
        return -1;
        }

....

                if (ret > 0) {
                        //
                        unsigned int offset = 0;
                        while (offset < sizeof(page) && (rc = pcre_exec(re, 0, page, sizeof(page), offset, 0, ovector, sizeof(ovector))) >= 0)
                        {
                                for(i = 0; i < rc; ++i)
                                {
                                        printf("%.*s\n", ovector[2*i+1] - ovector[2*i], page + ovector[2*i]);
                                }
                                offset = ovector[1];
                        }
                        //
                }

regex - Linux posix C regexec() not returning all matches - Stack Over...

c regex linux posix
Rectangle 27 0

Why matches is not filled in position 1?

regex_t a;
    regcomp(&a,"brasil",REG_ICASE);

    regmatch_t matches[2];
    size_t nmatch = 2;
    regexec(&a,"brasil brasil",nmatch,matches,0);

    int x;
    for(x=0;x<2;x++)
            printf("%i\n",matches[x].rm_so);

please post this a new question and not tag on to a previous question

regex - Posix regular expression in C - Stack Overflow

c regex linux gcc posix
Rectangle 27 0

regexec performs a regex match. Once a match has been found regexec will return zero (i.e. successful match). The parameter pmatch will contain information about that one match. The first array index (i.e. zero) will contain the entire match, subsequent array indices contain information about capture groups/sub-expressions.

const char* pattern = "(\\w+) (\\w+)";
num 0: 'hello world'  - entire match
num 1: 'hello'        - capture group 1
num 2: 'world'        - capture group 2

(see it in action)

In most regex environments the behaviour you seek could have been gotten by using the global modifier: /g. Regexec does not provide this modifier as a flag nor does it support modifiers. You will therefore have to loop while regexec returns zero starting from the last character of the previous match to get all matches.

The global modifier is also not available using the PCRE library (famous regex C library). The PCRE man pages have this to say about it:

By calling pcre_exec() multiple times with appropriate arguments, you can mimic Perl's /g option

regex - why regexec() in posix c always return the first match,how can...

c regex linux
Rectangle 27 0

If you iterate over all matches, and don't get both [3,6) and [0,6), then there's a bug. I'm not sure what posix mandates as far as order in which matches are returned.

Iterating over all matches gives me [3,6), [3,3), and [3,3). The first one is the match for the regex as a whole according to the regexec man page.

c - expected behavior of posix extended regex: (()|abc)xyz - Stack Ove...

c regex posix
Rectangle 27 0

Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have solved a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.

Possible solution to the harder problem follows below.

I have worked out what seems to be a solution in O(log q) space (where q is the number of question marks in the pattern, and thus q < m) and uncertain but seemingly better-than-exponential time.

First of all, a quick explanation of the problem reduction. First break the pattern at each *; it decomposes as a (possibly zero length) initial and final component, and a number of internal components flanked on both sided by a *. This means once we've determined if the initial/final components match up, we can apply the following algorithm for internal matches: Starting with the last component, search for the match in the string that starts at the latest offset. This leaves the most possible "haystack" characters free to match earlier components; if they're not all needed, it's no problem, because the fact that a * intervenes allows us to later throw away as many as needed, so it's not beneficial to try "using more ? marks" of the last component or finding an earlier occurrence of it. This procedure can then be repeated for every component. Note that here I'm strongly taking advantage of the fact that the only "repetition operator" in the fnmatch expression is the * that matches zero or more occurrences of any character. The same reduction would not work with regular expressions.

With that out of the way, I began looking for how to match a single component efficiently. I'm allowing a time factor of n, so that means it's okay to start trying at every possible position in the string, and give up and move to the next position if we fail. This is the general procedure we'll take (no Boyer-Moore-like tricks yet; perhaps they can be brought in later).

For a given component (which contains no *, only literal characters, brackets that match exactly one character from a given set, and ?), it has a minimum and maximum length string it could match. The minimum is the length if you omit all ? characters and count bracket expressions as one character, and the maximum is the length if you include ? characters. At each position, we will try each possible length the pattern component could match. This means we perform q+1 trials. For the following explanation, assume the length remains fixed (it's the outermost loop, outside the recursion that's about to be introduced). This also fixes a length (in characters) from the string that we will be comparing to the pattern at this point.

Now here's the fun part. I don't want to iterate over all possible combinations of which ? characters do/don't get used. The iterator is too big to store. So I cheat. I break the pattern component into two "halves", L and R, where each contains half of the ? characters. Then I simply iterate over all the possibilities of how many ? characters are used in L (from 0 to the total number that will be used based on the length that was fixed above) and then the number of ? characters used in R is determined as well. This also partitions the string we're trying to match into part that will be matched against pattern L and pattern R.

Now we've reduced the problem of checking if a pattern component with q ? characters matches a particular fixed-length string to two instances of checking if a pattern component with q/2 ? characters matches a particular smaller fixed-length string. Apply recursion. And since each step halves the number of ? characters involved, the number of levels of recursion is bounded by log q.

c - Is there a known O(nm)-time/O(1)-space algorithm for POSIX filenam...

c regex string pattern-matching substring
Rectangle 27 0

You need to pass in a set of regmatch_ts that the regex can fill with the indices of the matches. Try the below program with a single command line argument (the string to test).

Once you have the indices of the matches, it should be fairly easy to pull out what you're after. (Note: matches[0] will be the match of the entire expression, so the subexpressions start at matches[1].)

#include <stdlib.h>
#include <stdio.h>
#include <regex.h>

int main(int argc, char* argv[])
{
    const char* pattern = "{{( )*(([[:alnum:]]+\\.)*)?[[:alnum:]]+( )*}}";
    regex_t rex;
    int rc;

    if ((rc = regcomp(&rex, pattern, REG_EXTENDED))) {
        fprintf(stderr, "error %d compiling regex\n", rc);
        /* retrieve error here with regerror */
        return -1;
    }

    regmatch_t* matches = malloc(sizeof(regex_t) * (rex.re_nsub + 1));

    if ((rc = regexec(&rex, argv[1], rex.re_nsub + 1, matches, 0))){
        printf("no match\n");
        /* error or no match */
    } else {
        for(int i = 0; i < rex.re_nsub; ++i) {
            printf("match %d from index %d to %d: ", i, matches[i].rm_so,
                   matches[i].rm_eo);
            for(int j = matches[i].rm_so; j < matches[i].rm_eo; ++j) {
                printf("%c", argv[1][j]);
            }
            printf("\n");
        }
    }

    free(matches);
    regfree(&rex);

    return 0;
}

How can I easily get regex selections in C? - Stack Overflow

c regex unix posix
Rectangle 27 0

An ERE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are as follows:

.[\(

The <period>, <left-square-bracket>, <backslash>, and <left-parenthesis> shall be special except when used in a bracket expression (see RE Bracket Expression ). Outside a bracket expression, a <left-parenthesis> immediately followed by a <right-parenthesis> produces undefined results.

What you are seeing is the result of invoking undefined behaviour - anything goes.

If you want reliable, portable results, you will have to eliminate the empty '()' notations.

Yeah, I think the best choice is to avoid using (). Although my system does define the behavior I wanted in its re_format(7) man page, the thing to do is stick to POSIX. Thanks for digging up the reference.

c - expected behavior of posix extended regex: (()|abc)xyz - Stack Ove...

c regex posix
Rectangle 27 0

REGULAR EXPRESSIONS
       A  regular  expression  is  a  pattern that describes a set of strings.  Regular expressions are constructed
       analogously to arithmetic expressions, by using various operators to combine smaller expressions.

       grep understands two different versions of regular expression syntax: "basic" and "extended."  In  GNU grep,
       there  is  no  difference  in  available functionality using either syntax.  In other implementations, basic
       regular expressions are less powerful.  The following description applies to extended  regular  expressions;
       differences for basic regular expressions are summarized afterwards.

       The fundamental building blocks are the regular expressions that match a single character.  Most characters,
       including all letters and digits, are regular expressions that match themselves.   Any  meta-character  with
       special meaning may be quoted by preceding it with a backslash.

       The period . matches any single character.

   Character Classes and Bracket Expressions
       A  bracket  expression is a list of characters enclosed by [ and ].  It matches any single character in that
       list; if the first character of the list is the caret ^ then it matches any character not in the list.   For
       example, the regular expression [0123456789] matches any single digit.

       Within  a  bracket  expression,  a  range  expression  consists of two characters separated by a hyphen.  It
       matches any single character that sorts between the two characters, inclusive, using the locale's  collating
       sequence  and  character  set.   For  example, in the default C locale, [a-d] is equivalent to [abcd].  Many
       locales sort characters in dictionary order, and in these locales  [a-d]  is  typically  not  equivalent  to
       [abcd];  it  might  be  equivalent  to  [aBbCcDd], for example.  To obtain the traditional interpretation of
       bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.

       Finally, certain named classes of characters are predefined within bracket expressions, as  follows.   Their
       names  are  self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:],
       [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].  For  example,  [[:alnum:]]  means  [0-9A-Za-z],
       except  the  latter  form  depends upon the C locale and the ASCII character encoding, whereas the former is
       independent of locale and character set.  (Note that the brackets in these  class  names  are  part  of  the
       symbolic  names,  and must be included in addition to the brackets delimiting the bracket expression.)  Most
       meta-characters lose their special meaning inside bracket expressions.  To include  a  literal  ]  place  it
       first  in  the  list.  Similarly, to include a literal ^ place it anywhere but first.  Finally, to include a
       literal - place it last.

   Anchoring
       The caret ^ and the dollar sign $ are meta-characters that  respectively  match  the  empty  string  at  the
       beginning and end of a line.

   The Backslash Character and Special Expressions
       The symbols \< and \> respectively match the empty string at the beginning and end of a word.  The symbol \b
       matches the empty string at the edge of a word, and \B matches the empty string provided  it's  not  at  the
       edge of a word.  The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].

   Repetition
       A regular expression may be followed by one of several repetition operators:
       ?      The preceding item is optional and matched at most once.
       *      The preceding item will be matched zero or more times.
       +      The preceding item will be matched one or more times.
       {n}    The preceding item is matched exactly n times.
       {n,}   The preceding item is matched n or more times.
       {,m}   The preceding item is matched at most m times.
       {n,m}  The preceding item is matched at least n times, but not more than m times.

   Concatenation
       Two  regular  expressions may be concatenated; the resulting regular expression matches any string formed by
       concatenating two substrings that respectively match the concatenated expressions.

   Alternation
       Two regular expressions may be joined by the infix operator |; the resulting regular expression matches  any
       string matching either alternate expression.

   Precedence
       Repetition  takes  precedence  over concatenation, which in turn takes precedence over alternation.  A whole
       expression may be enclosed in parentheses to override these precedence rules and form a subexpression.

   Back References and Subexpressions
       The back-reference \n, where n is a single digit, matches  the  substring  previously  matched  by  the  nth
       parenthesized subexpression of the regular expression.

   Basic vs Extended Regular Expressions
       In  basic  regular  expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead
       use the backslashed versions \?, \+, \{, \|, \(, and \).

       Traditional egrep did not support the { meta-character, and some egrep implementations support  \{  instead,
       so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.

       GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start
       of an invalid interval specification.  For example, the command grep -E '{1' searches for the  two-character
       string {1 instead of reporting a syntax error in the regular expression.  POSIX.2 allows this behavior as an
       extension, but portable scripts should avoid it.

regex - Regular expressions documentation while using grep - Stack Ove...

regex linux bash grep
Rectangle 27 0

You're confused about what first and second mean. In this expression:

is the first parenthesizes subexpression and

"([^$]*(\\$[A-Za-z][A-Za-z0-9_]*))+"
       ^________________________^    this part

is the second. If a parenthesized subexpression gets used more than once as part of a *, ?, +, or {} repetition operator, it's the last match that counts.

If you want to match an arbitrary number of instances, than rather than using the + on the end of your regex, you simply need to call regexec multiple times, and use the ending offset of the previous run as your new starting point.

regex - How to get offset of all repeated matches in POSIX C regexec()...

c regex posix
Rectangle 27 0

Regexes are as greedy as possible, without being too greedy. Had the left group been as greedy as you expect, the group that matches "identification division" would have been unable to match, erronously rejecting text, which was clearly in the language.

c - Posix regex capture group matching sequence - Stack Overflow

c regex
Rectangle 27 0

#define _POSIX_C_SOURCE 200809L

#include <stdio.h>
#include <stdlib.h>
#include <regex.h>

int main(void) {
    const char tests[2][4] = {"-l4", "-lm"};
    const char match[] = "-l[[:digit:]]+";
    regex_t rmatch;

    if ( regcomp(&rmatch, match, REG_EXTENDED) != 0 ) {
        perror("Error compiling regex");
        return EXIT_FAILURE;
    }

    for ( int i = 0; i < 2; ++i ) {
        if ( regexec(&rmatch, tests[i], 0, NULL, 0) != 0 ) {
            printf("No match for '%s'.\n", tests[i]);
        } else {
            printf("Matched '%s'.\n", tests[i]);
        }
    }

    return 0;
}

EDIT: In the code you posted, you've got a couple of problems:

if(regcomp(&regex,"-l[[digit:]]+",0)){
if( regcomp(&regex, "-l[[:digit:]]+", REG_EXTENDED) ) {

Your segmentation fault is actually nothing to do with your regular expressions, and comes from calling this:

when on an execution path where you never successfully opened a file. You should change:

FILE *f;
FILE *f = NULL;

and change:

fclose(f);
if ( f ) {
    fclose(f);
}

Making yourself familiar with gdb will go a long, long way towards getting you able to track these things down yourself.

Here's a modified version of your own code that'll work and includes some basic error-checking:

paul@local:~/src/c/scratch$ ./regex2 -l4
Matched '-l4' to regex
paul@local:~/src/c/scratch$ ./regex2 -r fakefile
argv[1] is -r
Couldn't open file fakefile
paul@local:~/src/c/scratch$ ./regex2 -tribbles
Couldn't open file -tribbles
paul@local:~/src/c/scratch$ ./regex2 testfile
This is a test.
paul@local:~/src/c/scratch$ ./regex2 -r testfile
argv[1] is -r

.tset a si sihTpaul@local:~/src/c/scratch$

Note than when you're using getc() and friends, they use ints, not chars. This is necessary in order to be able to store EOF.

EDIT 2: Per the question in your comment, you need to do four things to match a sub-group, in this case, the numeric part of the match.

Set up an array of type regmatch_t. You'll need at least two elements, since the first will match the entire regex, and you'll need at least a second for one sub-group. In the code below, I've added:

#define MAX_MATCHES 10
regmatch_t m_group[MAX_MATCHES];

Put parentheses around the part of the regex you want to extract. In the code below, I've changed:

"-l[[:digit:]]+"
"-l([[:digit:]]+)"

Pass your regmatch_t array to regexec() when you call it, along with the size. In the code below, I've changed:

} else if (regexec(&rmatch, argv[1], MAX_MATCHES, m_group, 0) == 0) {
  • Cycle through the array and deal with each match. Everytime the rm_so member of a regmatch_t array element is not -1, then you have a match. All I'm doing here is copying them to a buffer and printing them out: } else if ( regexec(&rmatch, argv[1], MAX_MATCHES, m_group, 0) == 0 ) { printf("Matched '%s' to regex\n", argv[1]); for ( int i = 0; i < MAX_MATCHES && m_group[i].rm_so != -1; ++i ) { char buffer[1000] = {0}; char * match_start = &argv[1][m_group[i].rm_so]; size_t match_size = m_group[i].rm_eo - m_group[i].rm_so; size_t match_len = match_size > 999 ? 999 : match_size; strncpy(buffer, match_start, match_len); printf("Matched group %d was '%s'\n", i, buffer); } }
#define _POSIX_C_SOURCE 200809L

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <regex.h>

#define MAX_MATCHES 10

void reversetext(FILE * f);

int main(int argc, char *argv[]) {
    regex_t rmatch;
    regmatch_t m_group[MAX_MATCHES];
    FILE *f = NULL;
    int c;

    if ( argc < 2 ) {
        printf("You need to enter at least one command line argument.\n");
        return EXIT_FAILURE;
    }

    if ( regcomp(&rmatch, "-l([[:digit:]])+", REG_EXTENDED) ) {
        printf("Could not compile regex.\n");
        return EXIT_FAILURE;
    }

    if ( strcmp(argv[1], "-r") == 0 && argc > 2) {
        printf("argv[1] is -r\n");
        if ( (f = fopen(argv[2], "r")) == NULL ) {
            fprintf(stderr, "Couldn't open file %s\n", argv[2]);
            return EXIT_FAILURE;
        }
        reversetext(f);
    } else if ( regexec(&rmatch, argv[1], MAX_MATCHES, m_group, 0) == 0 ) {
        printf("Matched '%s' to regex\n", argv[1]);
        for ( int i = 0; i < MAX_MATCHES && m_group[i].rm_so && ; ++i ) {
            char buffer[1000] = {0};
            char * match_start = &argv[1][m_group[i].rm_so];
            size_t match_size = m_group[i].rm_eo - m_group[i].rm_so;
            size_t match_len = match_size > 999 ? 999 : match_size;
            strncpy(buffer, match_start, match_len);
            printf("Matched group %d was '%s'\n", i, buffer);
        }
    }  else {
        if ( (f = fopen(argv[1], "r")) == NULL ) {
            fprintf(stderr, "Couldn't open file %s\n", argv[1]);
            return EXIT_FAILURE;
        }

        while ( (c = getc(f)) != EOF) {
            printf("%c", c);
        }
    }

    if ( f ) {
        fclose(f);
    }
}

void reversetext(FILE * f) {
    int c = getc(f);
    if (c == EOF) {
        return;
    }

    reversetext(f);
    printf("%c", c);
}
paul@local:~/src/c/scratch$ ./regex2 -l4
Matched '-l4' to regex
Matched group 0 was '-l4'
Matched group 1 was '4'
paul@local:~/src/c/scratch$

if you notice my posted code, it's the same I'm doing, but instead of tests[i], I am placing argv[1] which should be -l(i) but nothing is printing out..instead an error

@Atieh: It's not the same - you're not passing REG_EXTENDED to regcomp() in the code you posted.

it worked! thank you so much! been researching this for the past 5 hours. Allow me to go on with acing this assignment :)

@Atieh: No problem. I posted a working version of your code in my edit, along with some proper error checking.

regex - Dealing with POSIX in C - Error message: Segmentation Fault (c...

c regex unix terminal posix