| Submitter | Austin Clements |
|---|---|
| Date | 2011-11-08 03:55:23 |
| Message ID | <1320724523-23568-1-git-send-email-amdragon@mit.edu> |
| Download | mbox | patch |
| Permalink | /patch/1473/ |
| State | New |
| Headers | show |
Comments
Hi Austin. On Mon, 7 Nov 2011 22:55:23 -0500, Austin Clements <amdragon@MIT.EDU> wrote: > This optimizes the user's tagging query to exclude messages that won't > be affected by the tagging operation, saving computation and IO for > redundant tagging operations. > > For example, > notmuch tag +notmuch to:notmuch@notmuchmail.org > will now use the query > ( to:notmuch@notmuchmail.org ) and (not tag:"notmuch") > > In the past, we've often suggested that people do this exact > transformation by hand for slow tagging operations. This makes that > unnecessary. Thanks! This is a very useful optimization. Does it work for multiple tags and tag removal? I.e.: notmuch tag -inbox -unread +sent from:dmitry.kurochkin@gmail.com can be converted to: notmuch tag -inbox -unread +sent from:dmitry.kurochkin@gmail.com and (tag:inbox or tag:unread or (not tag:sent)) Regards, Dmitry > --- > I was about to implement this optimization in my initial tagging > script, but then I figured, why not just do it in notmuch so we can > stop telling people to do this by hand? > > NEWS | 9 ++++++ > notmuch-tag.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 85 insertions(+), 0 deletions(-) > > diff --git a/NEWS b/NEWS > index e00452a..9ca5e0c 100644 > --- a/NEWS > +++ b/NEWS > @@ -16,6 +16,15 @@ Add search terms to "notmuch dump" > search/show/tag. The output file argument of dump is deprecated in > favour of using stdout. > > +Optimizations > +------------- > + > +Automatic tag query optimization > + > + "notmuch tag" now automatically optimizes the user's query to > + exclude messages whose tags won't change. In the past, we've > + suggested that people do this by hand; this is no longer necessary. > + > Notmuch 0.9 (2011-10-01) > ======================== > > diff --git a/notmuch-tag.c b/notmuch-tag.c > index dded39e..62c4bf1 100644 > --- a/notmuch-tag.c > +++ b/notmuch-tag.c > @@ -30,6 +30,76 @@ handle_sigint (unused (int sig)) > interrupted = 1; > } > > +static char * > +_escape_tag (char *buf, const char *tag) > +{ > + const char *in = tag; > + char *out = buf; > + /* Boolean terms surrounded by double quotes can contain any > + * character. Double quotes are quoted by doubling them. */ > + *(out++) = '"'; > + while (*in) { > + if (*in == '"') > + *(out++) = '"'; > + *(out++) = *(in++); > + } > + *(out++) = '"'; > + *out = 0; > + return buf; > +} > + > +static char * > +_optimize_tag_query (void *ctx, const char *orig_query_string, char *argv[], > + int *add_tags, int add_tags_count, > + int *remove_tags, int remove_tags_count) > +{ > + /* This is subtler than it looks. Xapian ignores the '-' operator > + * at the beginning both queries and parenthesized groups and, > + * furthermore, the presence of a '-' operator at the beginning of > + * a group can inhibit parsing of the previous operator. Hence, > + * the user-provided query MUST appear first, but it is safe to > + * parenthesize and the exclusion part of the query must not use > + * the '-' operator (though the NOT operator is fine). */ > + > + char *escaped, *query_string; > + const char *join = ""; > + int i; > + unsigned int max_tag_len = 0; > + > + /* Allocate a buffer for escaping tags. */ > + for (i = 0; i < add_tags_count; i++) > + if (strlen (argv[add_tags[i]] + 1) > max_tag_len) > + max_tag_len = strlen (argv[add_tags[i]] + 1); > + for (i = 0; i < remove_tags_count; i++) > + if (strlen (argv[remove_tags[i]] + 1) > max_tag_len) > + max_tag_len = strlen (argv[remove_tags[i]] + 1); > + escaped = talloc_array(ctx, char, max_tag_len * 2 + 3); > + > + /* Build the new query string */ > + if (strcmp (orig_query_string, "*") == 0) > + query_string = talloc_strdup (ctx, "("); > + else > + query_string = talloc_asprintf (ctx, "( %s ) and (", orig_query_string); > + > + for (i = 0; i < add_tags_count; i++) { > + query_string = talloc_asprintf_append_buffer ( > + query_string, "%snot tag:%s", join, > + _escape_tag (escaped, argv[add_tags[i]] + 1)); > + join = " or "; > + } > + for (i = 0; i < remove_tags_count; i++) { > + query_string = talloc_asprintf_append_buffer ( > + query_string, "%stag:%s", join, > + _escape_tag (escaped, argv[remove_tags[i]] + 1)); > + join = " or "; > + } > + > + query_string = talloc_strdup_append_buffer (query_string, ")"); > + > + talloc_free (escaped); > + return query_string; > +} > + > int > notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) > { > @@ -93,6 +163,12 @@ notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) > return 1; > } > > + /* Optimize the query so it excludes messages that already have > + * the specified set of tags. */ > + query_string = _optimize_tag_query (ctx, query_string, argv, > + add_tags, add_tags_count, > + remove_tags, remove_tags_count); > + > config = notmuch_config_open (ctx, NULL, NULL); > if (config == NULL) > return 1; > -- > 1.7.7.1 > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch
On Mon, 7 Nov 2011 22:55:23 -0500, Austin Clements <amdragon@MIT.EDU> wrote: > This optimizes the user's tagging query to exclude messages that won't > be affected by the tagging operation, saving computation and IO for > redundant tagging operations. +1 for this!
Quoth Dmitry Kurochkin on Nov 08 at 8:34 am: > Hi Austin. > > On Mon, 7 Nov 2011 22:55:23 -0500, Austin Clements <amdragon@MIT.EDU> wrote: > > This optimizes the user's tagging query to exclude messages that won't > > be affected by the tagging operation, saving computation and IO for > > redundant tagging operations. > > > > For example, > > notmuch tag +notmuch to:notmuch@notmuchmail.org > > will now use the query > > ( to:notmuch@notmuchmail.org ) and (not tag:"notmuch") > > > > In the past, we've often suggested that people do this exact > > transformation by hand for slow tagging operations. This makes that > > unnecessary. > > Thanks! This is a very useful optimization. > > Does it work for multiple tags and tag removal? I.e.: > > notmuch tag -inbox -unread +sent from:dmitry.kurochkin@gmail.com > > can be converted to: > > notmuch tag -inbox -unread +sent from:dmitry.kurochkin@gmail.com and (tag:inbox or tag:unread or (not tag:sent)) Yep. This is pretty much exactly what it does.
FWIW, I reviewed this and didn't find any obvious problems. A few nitpicks below, though. BR, Jani. On Mon, 7 Nov 2011 22:55:23 -0500, Austin Clements <amdragon@MIT.EDU> wrote: > This optimizes the user's tagging query to exclude messages that won't > be affected by the tagging operation, saving computation and IO for > redundant tagging operations. > > For example, > notmuch tag +notmuch to:notmuch@notmuchmail.org > will now use the query > ( to:notmuch@notmuchmail.org ) and (not tag:"notmuch") > > In the past, we've often suggested that people do this exact > transformation by hand for slow tagging operations. This makes that > unnecessary. > --- > I was about to implement this optimization in my initial tagging > script, but then I figured, why not just do it in notmuch so we can > stop telling people to do this by hand? > > NEWS | 9 ++++++ > notmuch-tag.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 85 insertions(+), 0 deletions(-) > > diff --git a/NEWS b/NEWS > index e00452a..9ca5e0c 100644 > --- a/NEWS > +++ b/NEWS > @@ -16,6 +16,15 @@ Add search terms to "notmuch dump" > search/show/tag. The output file argument of dump is deprecated in > favour of using stdout. > > +Optimizations > +------------- > + > +Automatic tag query optimization > + > + "notmuch tag" now automatically optimizes the user's query to > + exclude messages whose tags won't change. In the past, we've > + suggested that people do this by hand; this is no longer necessary. > + > Notmuch 0.9 (2011-10-01) > ======================== > > diff --git a/notmuch-tag.c b/notmuch-tag.c > index dded39e..62c4bf1 100644 > --- a/notmuch-tag.c > +++ b/notmuch-tag.c > @@ -30,6 +30,76 @@ handle_sigint (unused (int sig)) > interrupted = 1; > } > > +static char * > +_escape_tag (char *buf, const char *tag) > +{ > + const char *in = tag; > + char *out = buf; > + /* Boolean terms surrounded by double quotes can contain any > + * character. Double quotes are quoted by doubling them. */ > + *(out++) = '"'; > + while (*in) { > + if (*in == '"') > + *(out++) = '"'; > + *(out++) = *(in++); > + } > + *(out++) = '"'; The parenthesis are unnecessary for *p++. > + *out = 0; > + return buf; > +} > + > +static char * > +_optimize_tag_query (void *ctx, const char *orig_query_string, char *argv[], > + int *add_tags, int add_tags_count, > + int *remove_tags, int remove_tags_count) > +{ > + /* This is subtler than it looks. Xapian ignores the '-' operator > + * at the beginning both queries and parenthesized groups and, > + * furthermore, the presence of a '-' operator at the beginning of > + * a group can inhibit parsing of the previous operator. Hence, > + * the user-provided query MUST appear first, but it is safe to > + * parenthesize and the exclusion part of the query must not use > + * the '-' operator (though the NOT operator is fine). */ > + > + char *escaped, *query_string; > + const char *join = ""; > + int i; > + unsigned int max_tag_len = 0; > + > + /* Allocate a buffer for escaping tags. */ > + for (i = 0; i < add_tags_count; i++) > + if (strlen (argv[add_tags[i]] + 1) > max_tag_len) > + max_tag_len = strlen (argv[add_tags[i]] + 1); > + for (i = 0; i < remove_tags_count; i++) > + if (strlen (argv[remove_tags[i]] + 1) > max_tag_len) > + max_tag_len = strlen (argv[remove_tags[i]] + 1); > + escaped = talloc_array(ctx, char, max_tag_len * 2 + 3); Perhaps a comment here or above _escape_tag() explaining the worst case memory consumption of strlen(tag) * 2 + 3 for a tag of "s would be in order. It's unrelated, but looking at the above also made me check something I've suspected before: notmuch allows you to have empty or zero length tags "", which is probably not intentional. There's no check for talloc failures here or below. But then there are few checks for that in the cli in general. *shrug*. > + > + /* Build the new query string */ > + if (strcmp (orig_query_string, "*") == 0) > + query_string = talloc_strdup (ctx, "("); > + else > + query_string = talloc_asprintf (ctx, "( %s ) and (", orig_query_string); > + > + for (i = 0; i < add_tags_count; i++) { > + query_string = talloc_asprintf_append_buffer ( > + query_string, "%snot tag:%s", join, > + _escape_tag (escaped, argv[add_tags[i]] + 1)); > + join = " or "; > + } > + for (i = 0; i < remove_tags_count; i++) { > + query_string = talloc_asprintf_append_buffer ( > + query_string, "%stag:%s", join, > + _escape_tag (escaped, argv[remove_tags[i]] + 1)); > + join = " or "; > + } > + > + query_string = talloc_strdup_append_buffer (query_string, ")"); > + > + talloc_free (escaped); > + return query_string; > +} > + > int > notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) > { > @@ -93,6 +163,12 @@ notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) > return 1; > } > > + /* Optimize the query so it excludes messages that already have > + * the specified set of tags. */ > + query_string = _optimize_tag_query (ctx, query_string, argv, > + add_tags, add_tags_count, > + remove_tags, remove_tags_count); > + > config = notmuch_config_open (ctx, NULL, NULL); > if (config == NULL) > return 1; > -- > 1.7.7.1 > > _______________________________________________ > notmuch mailing list > notmuch@notmuchmail.org > http://notmuchmail.org/mailman/listinfo/notmuch
Quoth Jani Nikula on Nov 09 at 8:46 am: > > FWIW, I reviewed this and didn't find any obvious problems. A few > nitpicks below, though. > > BR, > Jani. > > On Mon, 7 Nov 2011 22:55:23 -0500, Austin Clements <amdragon@MIT.EDU> wrote: > > This optimizes the user's tagging query to exclude messages that won't > > be affected by the tagging operation, saving computation and IO for > > redundant tagging operations. > > > > For example, > > notmuch tag +notmuch to:notmuch@notmuchmail.org > > will now use the query > > ( to:notmuch@notmuchmail.org ) and (not tag:"notmuch") > > > > In the past, we've often suggested that people do this exact > > transformation by hand for slow tagging operations. This makes that > > unnecessary. > > --- > > I was about to implement this optimization in my initial tagging > > script, but then I figured, why not just do it in notmuch so we can > > stop telling people to do this by hand? > > > > NEWS | 9 ++++++ > > notmuch-tag.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 85 insertions(+), 0 deletions(-) > > > > diff --git a/NEWS b/NEWS > > index e00452a..9ca5e0c 100644 > > --- a/NEWS > > +++ b/NEWS > > @@ -16,6 +16,15 @@ Add search terms to "notmuch dump" > > search/show/tag. The output file argument of dump is deprecated in > > favour of using stdout. > > > > +Optimizations > > +------------- > > + > > +Automatic tag query optimization > > + > > + "notmuch tag" now automatically optimizes the user's query to > > + exclude messages whose tags won't change. In the past, we've > > + suggested that people do this by hand; this is no longer necessary. > > + > > Notmuch 0.9 (2011-10-01) > > ======================== > > > > diff --git a/notmuch-tag.c b/notmuch-tag.c > > index dded39e..62c4bf1 100644 > > --- a/notmuch-tag.c > > +++ b/notmuch-tag.c > > @@ -30,6 +30,76 @@ handle_sigint (unused (int sig)) > > interrupted = 1; > > } > > > > +static char * > > +_escape_tag (char *buf, const char *tag) > > +{ > > + const char *in = tag; > > + char *out = buf; > > + /* Boolean terms surrounded by double quotes can contain any > > + * character. Double quotes are quoted by doubling them. */ > > + *(out++) = '"'; > > + while (*in) { > > + if (*in == '"') > > + *(out++) = '"'; > > + *(out++) = *(in++); > > + } > > + *(out++) = '"'; > > The parenthesis are unnecessary for *p++. Removed. I put these in out of paranoia, but I suppose it wouldn't be an lvalue if it parsed differently. > > + *out = 0; > > + return buf; > > +} > > + > > +static char * > > +_optimize_tag_query (void *ctx, const char *orig_query_string, char *argv[], > > + int *add_tags, int add_tags_count, > > + int *remove_tags, int remove_tags_count) > > +{ > > + /* This is subtler than it looks. Xapian ignores the '-' operator > > + * at the beginning both queries and parenthesized groups and, > > + * furthermore, the presence of a '-' operator at the beginning of > > + * a group can inhibit parsing of the previous operator. Hence, > > + * the user-provided query MUST appear first, but it is safe to > > + * parenthesize and the exclusion part of the query must not use > > + * the '-' operator (though the NOT operator is fine). */ > > + > > + char *escaped, *query_string; > > + const char *join = ""; > > + int i; > > + unsigned int max_tag_len = 0; > > + > > + /* Allocate a buffer for escaping tags. */ > > + for (i = 0; i < add_tags_count; i++) > > + if (strlen (argv[add_tags[i]] + 1) > max_tag_len) > > + max_tag_len = strlen (argv[add_tags[i]] + 1); > > + for (i = 0; i < remove_tags_count; i++) > > + if (strlen (argv[remove_tags[i]] + 1) > max_tag_len) > > + max_tag_len = strlen (argv[remove_tags[i]] + 1); > > + escaped = talloc_array(ctx, char, max_tag_len * 2 + 3); > > Perhaps a comment here or above _escape_tag() explaining the worst case > memory consumption of strlen(tag) * 2 + 3 for a tag of "s would be in > order. Definitely. Done. > It's unrelated, but looking at the above also made me check something > I've suspected before: notmuch allows you to have empty or zero length > tags "", which is probably not intentional. > > There's no check for talloc failures here or below. But then there are > few checks for that in the cli in general. *shrug*. It's unfortunate that error handling obscures C code so much. But there's no sense in not handling errors, so I fixed this. > > + > > + /* Build the new query string */ > > + if (strcmp (orig_query_string, "*") == 0) > > + query_string = talloc_strdup (ctx, "("); > > + else > > + query_string = talloc_asprintf (ctx, "( %s ) and (", orig_query_string); > > + > > + for (i = 0; i < add_tags_count; i++) { > > + query_string = talloc_asprintf_append_buffer ( > > + query_string, "%snot tag:%s", join, > > + _escape_tag (escaped, argv[add_tags[i]] + 1)); > > + join = " or "; > > + } > > + for (i = 0; i < remove_tags_count; i++) { > > + query_string = talloc_asprintf_append_buffer ( > > + query_string, "%stag:%s", join, > > + _escape_tag (escaped, argv[remove_tags[i]] + 1)); > > + join = " or "; > > + } > > + > > + query_string = talloc_strdup_append_buffer (query_string, ")"); > > + > > + talloc_free (escaped); > > + return query_string; > > +} > > + > > int > > notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) > > { > > @@ -93,6 +163,12 @@ notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) > > return 1; > > } > > > > + /* Optimize the query so it excludes messages that already have > > + * the specified set of tags. */ > > + query_string = _optimize_tag_query (ctx, query_string, argv, > > + add_tags, add_tags_count, > > + remove_tags, remove_tags_count); > > + > > config = notmuch_config_open (ctx, NULL, NULL); > > if (config == NULL) > > return 1;
On Wed, 09 Nov 2011 08:46:02 +0000, Jani Nikula <jani@nikula.org> wrote: > It's unrelated, but looking at the above also made me check something > I've suspected before: notmuch allows you to have empty or zero length > tags "", which is probably not intentional. I had reported already that it is possible to add either an empty or a "\n" tag (not sure what gets added) via '+<RET>' in the emacs UI. See the newline in the tag list. Remove it again with '-<RET>' in emacs. Sebastian
Patch
diff --git a/NEWS b/NEWS index e00452a..9ca5e0c 100644 --- a/NEWS +++ b/NEWS @@ -16,6 +16,15 @@ Add search terms to "notmuch dump" search/show/tag. The output file argument of dump is deprecated in favour of using stdout. +Optimizations +------------- + +Automatic tag query optimization + + "notmuch tag" now automatically optimizes the user's query to + exclude messages whose tags won't change. In the past, we've + suggested that people do this by hand; this is no longer necessary. + Notmuch 0.9 (2011-10-01) ======================== diff --git a/notmuch-tag.c b/notmuch-tag.c index dded39e..62c4bf1 100644 --- a/notmuch-tag.c +++ b/notmuch-tag.c @@ -30,6 +30,76 @@ handle_sigint (unused (int sig)) interrupted = 1; } +static char * +_escape_tag (char *buf, const char *tag) +{ + const char *in = tag; + char *out = buf; + /* Boolean terms surrounded by double quotes can contain any + * character. Double quotes are quoted by doubling them. */ + *(out++) = '"'; + while (*in) { + if (*in == '"') + *(out++) = '"'; + *(out++) = *(in++); + } + *(out++) = '"'; + *out = 0; + return buf; +} + +static char * +_optimize_tag_query (void *ctx, const char *orig_query_string, char *argv[], + int *add_tags, int add_tags_count, + int *remove_tags, int remove_tags_count) +{ + /* This is subtler than it looks. Xapian ignores the '-' operator + * at the beginning both queries and parenthesized groups and, + * furthermore, the presence of a '-' operator at the beginning of + * a group can inhibit parsing of the previous operator. Hence, + * the user-provided query MUST appear first, but it is safe to + * parenthesize and the exclusion part of the query must not use + * the '-' operator (though the NOT operator is fine). */ + + char *escaped, *query_string; + const char *join = ""; + int i; + unsigned int max_tag_len = 0; + + /* Allocate a buffer for escaping tags. */ + for (i = 0; i < add_tags_count; i++) + if (strlen (argv[add_tags[i]] + 1) > max_tag_len) + max_tag_len = strlen (argv[add_tags[i]] + 1); + for (i = 0; i < remove_tags_count; i++) + if (strlen (argv[remove_tags[i]] + 1) > max_tag_len) + max_tag_len = strlen (argv[remove_tags[i]] + 1); + escaped = talloc_array(ctx, char, max_tag_len * 2 + 3); + + /* Build the new query string */ + if (strcmp (orig_query_string, "*") == 0) + query_string = talloc_strdup (ctx, "("); + else + query_string = talloc_asprintf (ctx, "( %s ) and (", orig_query_string); + + for (i = 0; i < add_tags_count; i++) { + query_string = talloc_asprintf_append_buffer ( + query_string, "%snot tag:%s", join, + _escape_tag (escaped, argv[add_tags[i]] + 1)); + join = " or "; + } + for (i = 0; i < remove_tags_count; i++) { + query_string = talloc_asprintf_append_buffer ( + query_string, "%stag:%s", join, + _escape_tag (escaped, argv[remove_tags[i]] + 1)); + join = " or "; + } + + query_string = talloc_strdup_append_buffer (query_string, ")"); + + talloc_free (escaped); + return query_string; +} + int notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) { @@ -93,6 +163,12 @@ notmuch_tag_command (void *ctx, unused (int argc), unused (char *argv[])) return 1; } + /* Optimize the query so it excludes messages that already have + * the specified set of tags. */ + query_string = _optimize_tag_query (ctx, query_string, argv, + add_tags, add_tags_count, + remove_tags, remove_tags_count); + config = notmuch_config_open (ctx, NULL, NULL); if (config == NULL) return 1;