Artemis - a platform for searching and data manipulation

Tămaș Ionuț
Software Developer @ TORA Trading Services

PROGRAMMING

A properly designed domain model has a lot of embedded information that is expressive and readable to the end users. For instance, the class Order has a Customer property, having a semantics like: "An order is made by a customer" and the Customer class has simple type properties like Name, Age, Email, etc. with easy to grasp semantics: A customer is named "John", has age 30 and with email "john\@doe.com". In this line of thought, we see that a well-designed domain model comes with a lot of free knowledge we can take advantage of when building our system's UX.

Most of the time, our operations on web applications can be generically reduced to just two activities: searching and processing items. Especially in back-office web systems, searching is traditionally implemented by navigating to a search page of a specific model and applying filters (via textboxes, dropdown lists, etc.) on a subset of properties to filter a grid of model entries.

However, if we take this approach, we don't take advantage of the underlying semantics we get for free when we designed the domain model (which expresses our domain logic), together with our stakeholders (including end-users).

Wouldn't it be nice if our search mechanism would consist of a single textbox that is able to answer queries such as Orders made by "John" with price > 200 and shipping city "Seattle" when searching for orders, and dynamically generating a grid based on the type we are searching for containing the filtered orders? Can these queries be applied to any type in our domain model?

This way, we provide a faster search experience for our users and remove some of the burden for developers when building filtering mechanisms.

Prologue: Problem definition and processing pipeline

Natural language processing (NLP) is the mechanism through which human language is automatically processed. NLP touches areas like text parsing, speech recognition, formal language theory, machine learning and human computer interaction. Natural user interface (NUI) focuses on building user interfaces that provide seamless human-machine interaction, NLP being a key component in providing these experiences.

Building a human-level NLP framework is an AI-complete (read impossible so far) problem and we will focus on building a framework that mimics NLP searching by restricting the user's input based on the entity - property semantic within our domain model classes. We also want to provide meaningful suggestions to guide the user to provide a valid input, to generate expressions that can be applied on EF data context or in-memory collections and to return a dynamically generated grid based on the searched collection type. We also want to enable developers to easily extend the entity - property semantics, by annotating properties with more expressive relationship semantics, enriching the search space for the end user.

Based on this, we sketch up a processing pipeline as in Figure 1:

Figure 1 - NLP Pipeline

Chapter 1: Bootstrapping phase

The bootstrapping phase is the place where we transform our domain model into a search graph and we will use this graph when parsing the user input. We have opted for an annotation-based model, where the developer annotates the properties of the domain classes as shown in Figure 2.

Figure 2

We have 3 types of attributes of interest for the search graph building phase, with an example in Figure 2:

SearchableAttribute: specifies if a class or a property within a class is searchable or not; by default all properties of a searchable class are searchable and this is where we specify the semantics of the domain model
NonSearcheableAttribute: specifies that a property is not searchable
ImplicitAttribute: some properties of a class are implicit: the query Orders made by customer with name "John" is equivalent to Orders made by "John", where the "managed by" expression is an alias for the Customer property of the Order class; this attribute help us parse expressive queries, without specifying the property names that can be deduced from context

[Searchable]
public partial class Order
{
    [NonSearchable]
    public int Id { get; set; }
    public string ShipCity { get; set; }
    public string ShipCountry { get; set; }
    public string ShipAddress { get; set; }
    public string Code { get; set; }
    public decimal TotalPrice { get; set; }

    [NonSearchable]
    public Nullable<int> ShipperId { get; set; }
    [NonSearchable]
    public Nullable<int> EmployeeId { get; set; }
    [NonSearchable]
    public Nullable<int> CustomerId { get; set; }

    [Searchable("managed by", "handled by", 
    "assigned to")]
    public virtual Employee Employee { get; set; }
    [Searchable("shipped by")]
    public virtual Shipper Shipper { get; set; }
    [Searchable("made by")]
    public virtual Customer Customer { get; set; }
}

By alias, we mean a string that is the natural expression representing the type-property semantic relationship. By default, all properties are searchable and the default semantics are "Entity with property name" for all non-boolean properties. Figure 2 shows an example of how we annotate our domain model, in this case directly on the EF generated entities. After annotating our domain model, we construct a graph that contains our searchable types as nodes linked together by edges containing the relationship semantics (from annotation aliases) and the property type. It is here that we decide whether the property is implicit or not.

Boolean properties deserve special attention, since they come in different flavors. In the bootstrapping phase, we extract a default alias for each property based on their name. Let's see some examples of common boolean property names and their group patterns:

Figure 3

In order to achieve these transformations we used the SimpleNLG natural language generation Java library, licensed under MPL 1.1. This allows us to detect verb tenses and generate sentences in the form (affirmative or negative, singular or plural) and tense that we need. We converted this library to a .NET assembly by using IKVM and by issuing the following command:

ikvmc SimpleNLG.jar -out:SimpleNLG.dll -target:library

Chapter 2: Parsing phase

The parsing phase is the meat of the framework: here we take the user input and process it against the search graph to build a data structure that will be used in the next processing steps.

Let's see some input examples that we want to parse and figure out the rules that we need to define for our parser:

Orders with total price 100
Orders with total price > 100 made by "John Doe"
Products supplied by vendor with name containing "John"
Orders made by customers that are premium with ship city "Seattle"

Based on these examples, the parsing rules from Figure 5 are defined:

Figure 4

Figure 5

Our initial design was straightforward: since we restrict the input to a predefined format like "Root entity {property-alias operator value} {and} {property-alias operator value} … ", initially it seemed that we could define a grammar extensible enough to fit our purposes. Irony is a great tool for this kind of jobs, but defining an unambiguous grammar for our parser proved to be impossible.

We implemented our own parser based on the FSM shown in Figure 6. We tokenized the input by splitting it by whitespaces. The parse-container data structure keeps a state, stack of the current identified types and a tree structure containing the types and their property queries, as shown in the example depicted in Figure 7, where we see how the parse container structure transitions as we feed in the tokens.

Boolean expressions are trickier since the operator and comparand values are absent and we have to infer them from the relationship semantics. We do this by generating two kinds of aliases for each boolean property (the affirmative and the negated versions). In the parsing phase, after we identify the semantic version, we bypass the SimpleProperty and Operator states and go directly to the Comparand state.

Chapter 3: Data filtering phase

Following the parsing phase, we construct a generic expression for the root entity type based on the parsed input tree. We've opted for a fluent approach for building the generic expression as shown below for the "greater than" query, similar filters being implemented likewise:

public ExpressionBuilder AndGreaterThan(
  string property, int value)
{
  Expression source = GetExpressionBody(property);
  ConstantExpression targetValue = 
    Expression.Constant(value, source.Type); 

  BinaryExpression comparisonExpression = 
    Expression.GreaterThan(source, targetValue);

  _accumulator = Expression.AndAlso(
    _accumulator, comparisonExpression);

    return this;
}

For the "Orders with price > 100 managed by "John"", we will use it as follows:

Expression expression = ExpressionBuilder
               .Empty
               .WithType(typeof(Order))
               .AndGreaterThan("Prie", 100)
               .AndContains("Employee.Name", "John")
               .GetExpression();

where the GetExpression method returns a non-generic lambda expression of the specified type. After having our expression filter built, we've created a number of extension methods for applying it on IEnumerable or IQueryable instances:

static IEnumerable<T> WhereBy<T>
 (this IEnumerable<T> collection, Expression filter)

static IEnumerable<object> Where
 (this IQueryable queryable, Expression filter)

We can use these extension methods as follows:

IEnumerable<Order> orders = GetOrders();                    
// IEnumerable collection instance

IEnumerable<object> memoryResult = 
  orders.WhereBy(filter);     
// "WhereBy" extension

DbContext context = new ArtemisContext();                      
// EF data context

DbSet queryable = context.Set(typeof(Order));                  
// IQueryable instance

IEnumerable<object> queryableResult = 
  queryable.Where(filter); 
// "Where" extension

Chapter 4: Prediction phase

Based on the unreduced expression in the parse container, we give user suggestions for providing a valid query input. We use the current parse state and the current unidentified tokens for generating the suggestions. Of interest here are the property suggestions, where we use the parsed types stack to give higher priority to properties of types on top of the parsed types stack than those at the bottom. In most cases, users will apply filters on the last entity type. We also give precedence to properties that were not already used, since most likely he will apply only one filter per property. Figure 7 shows the evolution of the parsed entities stack and the constructed query tree for the input: Orders with price > 100 managed by "John".

Epilogue: Wrapping up

Natural language processing is a very complex area in computer science. In this article, we focused on the steps of building a fairly robust NLP apparatus restricted to any domain model, together with an extensible annotation facility, a prediction mechanism on guiding the user to provide a valid input and a set of utilities for applying the expression filter on data collection. Based on our end-user tests, we've achieved, after training, a time reduction of up to 30% in searching operations on complex, data-heavy web administration applications.