Archive for the ‘engineering’ Category

Knocking MapReduce

Friday, January 18th, 2008

MapReduce is the power application for grid computing.  Grid computing works very well if the problem is “embarrassing parallel” but it seems to stop there.  In the book, In Search of Clusters, you can’t expect to create a single system image or single image memory without making out your network pipes.  As long as the problem doesn’t require you share state, then grid computing works.  MapReduce is such a function that works very well for grid computing and like most “new” things it requires that most folks point out the death of the “old,” in this case, the relational database. I personally think that databases and MapReduce both solve real problem and are therefore valid approaches.

I think the following post, MapReduce: A major step backward, helps put things in perspective.  DeWitt and Stonebraker do a good job with analysis, but occasionally step over the line of objectivity with key comments like “…[g]iven the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale.”  I think Google’s implementation of MapReduce fits a solution to a problem very, very well and scales accordingly.  However, I agree that MapReduce applied generally is not a solution that scales.

The new way forward

Tuesday, November 13th, 2007

There is a new pattern for scaling web based systems that is emerging. Cache Farms and Read Pooling.

On the JavaSE 7, JCache is being proposed to provide a cache abstraction although it has been criticized for being out of date. Read Pooling is typically handled at the JDBC driver level or with a JDBC proxy.

There is still a good deal of engineering to be done with regards to these new technologies. I for one would like to see annotations that describe how an entity should be cached, something similar to ETags. At least on the Java runtime, various layers could manage caching much more precisely as requested by the domain object. For example, letting JPA’s cache know what to evict after how long so that clients that can tolerate stale data can coexist.

Schema First and Anemic Domain Models

Thursday, November 8th, 2007

I was recently introduced to the notion of Anemic Objects while discussing schema first or POJO first xml definition. I come from the camp of schema first and considered the resulting java objects to be just a normal side effect from this approach. The alternative has scary ramifications from a system design/interoperability perspective.

Before I get into my thoughts on either approach, lets cover what an anemic object is. According to Martin Fowler, they’re an anti-pattern.

The fundamental horror of this anti-pattern is that it’s so contrary to the basic idea of object-oriented design; which is to combine data and process together.

The service, or business, objects are responsible for extracting the required bits and performing everything from validation to the actual business logic.

So is schema-first really a cause for this problem? Casually speaking, yes, but I believe it’s mostly due to the ease of code gen and focus on ease of development. For most developers, learning WS-Schema is not a simple undertaking, just like learning any language or standard. If I know Java, and a tool will create the required artifacts for me, then I can ignore what’s generated as long as it keeps in sync with my domain model.

Keeping in sync with a domain model can be challenging when a developer doesn’t control the underlying schema. Think about experiences of mapping a robust database schema to an object model. The relational/object impidence has resulted in numerous frameworks, Hibernate, JDO, JPA, EJB, Toplink and many more, to handle these differences. Even then, all allow for a way to just use plain SQL and do the mapping yourself. The same mismatch applies to XML Schema as well. There are various frameworks, JAXB and XMLBeans for example, that help out in this area as well.

It’s easier to allow the Domain Model to drive these models or to just consider any schema-first output as a second class object or bit bucket. I’ll admit to subscribing to the latter. As an enterprise architect, it’s easier to think in terms of data schema and service endpoints that act on it. WS-* and REST both encourage this type of behavior. For the developer on the ground, this can be limiting as it does not help promote OO design.

Indeed often these models come with design rules that say that you are not to put
any domain logic in the the domain objects.

But it doesn’t have to be that way.

I’ll pick on the automated build process for a moment. I think what happens is that the auto-gen of either Schema or POJO gets baked into the build process and developers forget. The output is a secondary though that is constantly sync’d with their domain. The problem is that those external parties that rely on those artifacts are also treated as second class. A developer changes and external interface contract without even thinking. This is the part that is scary

The middle ground is to use the schema gen once and lock it down. Put it in the repository and protect it with the walls of governance. The auto-gen can still place, but it needs to also verify that it has not broken the contract. This way, a schema that started from a POJO and given an official approval as the external interface is protected form unwarranted change, and the developer is granted the ability to grow the POJO into the rich domain model they want.

This approach places constraints on the evolution of a domain object so it’s not recommend that some time be taken to get the domain object as close to the use case/user story as possible. Any evolution will be quickly detected and signal integration or legacy concerns immediately.

Anemic models can be address with a slight change in build process and a stronger embrace of the binding tools that typically generate them.

MultithreadedTC

Saturday, July 21st, 2007

Overview of MultithreadedTC - Dept. of Computer Science, UMD

I just came across this test harness for multithreaded applications.  What a great tool!  Thanks UMD.

Gears of Data Synchronization

Monday, June 4th, 2007

Google Gears API Developer’s Guide (Beta) - Architecture

Resolving these differences so that the two stores are the same is called “synchronization”. There are many approaches to synchronization and none are perfect for all situations. The solution you ultimately choose will likely be highly customized to your particular application.

[emphasis mine]

Data Synchronization is perhaps the next technological “platform” in every developer’s toolchest. I’ve been diving deeper into this topic to gain a better understanding of what the market place looks like and what standard solutions are off the shelf. Unfortunately, it doesn’t look good for a practitioner’s perspective.

I don’t think the idea of system of record is going away, but pulling a record, making changes and updating the record currently with edits to the SOR is becoming more and more common. “Current trends suggest that this pattern will matter more, not less, over time. In a service-oriented world, systems of record will recede into the background.A lot of custom code is currently hand written to manage reconciliation or synchronization for “detached” datasets. WS-* and its document-centric interfaces is bringing this integration need to the foreground, albeit slowly.

Think of it as a general purpose data synchroization framework where commits and rollbacks are replaced with merge and resolve conflict.

Business Activity Monitoring…the new portal or just a new name

Thursday, March 1st, 2007

I’ve run across the term Business Activity Monitoring recently and stood wondering if term is the same for everyone or is it another phrase, like SOA, to confuse and bewilder.  Business Activity Monitoring is a  term commonly referred to as BAM.

BAM presented itself in context of a BPEL engine.  The goal is to take a BPEL process and monitor it in a dashboard.   Other than the term BPEL, isn’t this what executive dashboards have been striving for since the portal was first produced?  This begs the question, why would someone buy a BAM product when they probably have four portal vendors already in their software arsenal?

Before I answer the question, it might be important to look at some key differentiators.  First, there are events.  The vast majority of portals out there report data.  In the last couple of years, portals have added user interaction that stitches together uniformly vis WSRP, but it’s not panaceas.  So what makes events so significant in a portal?  Firstly, they must be pushed to the client.  Until the advent of AJAX, pushing content to the browser was not common practice.  Now the line between request/response blurs to the point that entire productivity suites can be deployed on the web.  The ability to push events to the web make Business Activity Monitoring something that is ready for the masses.

Another differentiators is how the data is collected and processed.  BAM relies on an “acquire, correlate and notify loop” that prompts a business user to action as events are streamed to end user in real-time.  Contrast that to the the data warehouse model where data is batch processed into the final report and published at regular intervals.  Data is not streamed in real-time.

So what should one look for in a BAM solution? I woulds suggest the ability to capture the data in real-time without killing the systems under observation. A BAM solution also needs the ability to link performance metrics from the underlying subsystems to provide context to a business process in trouble. Just knowing that the intake of a transaction is slow does not necessarily give you enough information to solve the problem.  The crucial link between business and IT must still be maintained.   IT will need the ability to see the business context and drill down to the technical context that is the causing the adverse impact.

Is BAM the new portal?  Probably not.  It represents a new type of user interaction the emphasizes reporting in real-time.  Ideally, a BAM module will fit into an existing portal system so business user can use the information to make better informed decisions.

Excel is not a functional langauge: Got it!

Sunday, February 11th, 2007

I recently received some attention to a piece I wrote a while back about closures. Basically, folks reacted to my linkage between functional languages and Excel. Being that this site bills itself as a loudspeaker for an Enterprise Architect, I received some particular criticism around my “right” to make such a link. While I don’t claim to be an expert in functional languages, I do not believe I’m alone in my correlation. Some very smart people have also asked the same question: Is a spreadsheet a functional language, or at least a platform for a functional language?

  • http://research.microsoft.com/~simonpj/Papers/excel/index.htm
  • http://lists.canonical.org/pipermail/kragen-tol/2002-May/000713.html
  • http://citeseer.ist.psu.edu/lisper02haxcel.html

So perhaps Excel is not a functional language outright, but its awfully close. The folks at the Haskall Cafe even considered it in this thread. Bjorn Lisper comment sums it up nicely:

Yes, every cell in isolation contains an expression possibly with free variables, and so can be seen as a function in those variables. But these variables are not unbound since they are defined elsewhere in the spreadsheet. Thus, the sheet is rather a system of equations defining values, not functions. I think 0:th order is a good term
:-)

Making a spreadsheet more intelligent is not beyond the realms of research or practices. The following applications attempt to bridge the gap.

  • http://siag.nu/siag/
  • http://www.mrtc.mdh.se/projects/Haxcel/

They’re definitely not going to eat into Excel’s revenue stream, but they do provide a higher order functional programming language on top of the spreadsheet metaphor.

Finally, I think Bulat Ziganshin sums it up nicely.

> Heard that statement recently — that Excel is a functional
> programming language, and the most used one — of any programming
> languages — on Earth! Is it true?

that’s true and breaks any words that FP thinking is “unnatural” for peoples :)

what matters here is that in Excel you *don’t* define calculation order which makes it a functional rather than imperative approach to computations

If, as an enterprise architect, I needed to make the case for functional languages, I can point to Excel. Business people know how powerful Excel is to their bottom line. Technical folks could ignore the semantics of such a correlation and focus on the end result: using a platform that will deliver the needs of the business quickly and accurately.

Transaction Odessey

Sunday, January 14th, 2007

REST vs. SOA (i.e WS-*)…the debate rages on.  If I can take a moment and step out of the debate between which discipline is better for doing web services and direct your attention to the 600-lbs gorilla in the room: transactions.

Personally, I had predicted that WS-* would ultimately come up with standard for distributed transactions via web services.  And why not?  WS-Transaction has already been spec’d by BEA, IBM and Microsoft.  Also, a quick search across google didn’t reveal any discussions on transactions and REST, probably because transactions over http have been around since the inception of eCommerce.  However, I’m comparing apples to oranges.  REST vs SOA is really REST vs SOAP, WSDL and UDDI.  WS-Transaction could be implemented using the REST discipline; however, I’m basing this on my opinion that REST is subset of SOAP, WSDL and UDDI.

Transactions haven’t been covered in great detail in the REST world and only theoretically in the WS-* world (see WS-TX above).  If a transaction is needed on a resource it must fall behind the interface.  SOA, SOAP, and REST proponents all seem to consider a transaction to be handled behind the curtain.  If two resources need to coordinate, i.e. two databases, XA/Open’s 2PC protocol comes into play.  The client doesn’t have access to what happens behind the interface and diligently waits for a positive or negative outcome.  Of course, someone much smarter than myself has already done the heavy thinking on this.
The always resourceful Stu Charlton pointed me to a paper written by Pat Helland entitled, Life beyond Distributed Transactions: an Apostate’s Opinion, which gives a good summary of building ridiculously scalable transactional systems.  As it would turn out, 2PC doesn’t cut it.  It’s either too slow or too complex.  Pat does provide some useful patterns that he believes will make an application scale.  It basically makes the case for REST and WS-*.

The two fundamental concept that Life beyond Distributed Transactions drives home is you must build your applications around entities and activities.  First entities are uniquely named.  Second, the entity is transactional so that each operation either completes or fails atomically.  This aligns nicely with what REST is touting as the one true way.  Documents are posted to URIs and operations act as if a resource or entity resides behind each URI.

The next concept is that each entity has activities that related to predefined interactions with other entities.  If I’m understanding this correctly, activities are predefined therefore the entities that can interact are limited.  I think this construct aligns with WSDL and SOAP.   WSDL defines ports and bindings to the functions.  If a REST-like interface is provide on a SOAP endpoint, then WSDL is the language to describe those activities.

So it’s REST and SOA for massively scalable systems?  I haven’t decided yet, but I have to thank Stu for keeping me on my toes.

Does Governance equate to automated Event Correlation?

Tuesday, January 9th, 2007

A java profiler that I think offers the best feature/price ratio has pre-announced its new production: JINSPIRED JXInsight - JXInsight Govern. I’m always curious about governance products because the premise is that they’ll somehow make decisions for you.

Just to level set, governance is the process for presenting solutions and deciding which one should be implemented. As a quick tangent, organization specifies who gets to make the decision. I don’t believe these can be implement in computers, otherwise we’ll have a situation like Paul Proteus in Vonnegut’s novel Player Piano.

Getting back to JXInsight Govern, it is a product that does event correlation, plain and simple. It’s similiar to quite a few other products like EMC Smarts, and Jeff Jonas’s NORA, now IBM DB2 Identity Resolution. Esper is an open source library that provides and engine and I’m sure there are a few more (please feel free to leave a comment linking to your product). These are all tools that do event correlation, and yet they don’t govern, so JXInsight Govern is a nice title, but doesn’t necessarily do what it claims. So why allude to governance? What does event correlation and governance have to do with one another?

Data Analysis

Granted, its pre-canned data analysis based on a generalized case, but it saves a system analysts from having to dig into the data to find a trends that are worth reporting. In the hands of an expert system analyst, it provide a baseline analysis to check against. However, it doesn’t render a decision and really can’t be consider a replacement for governance.

Event correlation does not equate to governance, but it does provide additional information so the best decision can be made.

dev2dev Forums

Wednesday, January 3rd, 2007

On the Weblogic 10 Tech Preview dev2dev Forums, an engineer is struggling with JNDI in an EJB3 application. The solution is to use a descriptor file, but the engineer provides the following knee jerk reaction:

…with the descriptor file it works properly, but I would like to avoid using descriptors, one of biggest goal of the EJB3.

This attitude is particularly troublesome since it represents a developer who is not looking to solve the problem, but instead looking for an aesthetic solution. Perhaps this is where the programming as art world meets those who wish to accomplish the job.

I typically find that most developers trip over JNDI and will build around their lack of understanding. From my perspective, Spring’s popularity came from developers who wanted an easier way and didn’t want to be bothered with reading the J2EE specification. Ultimately, it ended up with a similar mechanism. There is still a descriptor file and some creative ways of manging dependency, but ultimately its the a similar pattern for manging external environmental references.