Search and Find

Service

Information & Contact

Sharing Data and Models in Software Engineering - Sharing Data and Models

of: Tim Menzies, Ekrem Kocaguneli, Burak Turhan, Leandro Minku, Fayola Peters

Elsevier Reference Monographs, 2014

ISBN: 9780124173071 , 415 Pages

Format: PDF, ePUB

Copy protection: DRM

Apple iPod touch, iPhone und Android Smartphones

Sharing Data and Models in Software Engineering - Sharing Data and Models

Chapter 1

Introduction

Before we begin: for the very impatient (or very busy) reader, we offer an executive summary in Section 1.3 and statement on next directions in Chapter 25.

1.1 Why read this book?

NASA used to run a Metrics Data Program (MDP) to analyze data from software projects. In 2003, the research lead, Kenneth McGill, asked: “What can you learn from all that data?” McGill's challenge (and funding support) resulted in much work. The MDP is no more but its data was the seed for the PROMISE repository (Figure 1.1). At the time of this writing (2014), that repository is the focal point for many researchers exploring data science and software engineering. The authors of this book are long-time members of the PROMISE community.

Figure 1.1 The PROMISE repository of SE data: http://openscience.us/repo.

When a team has been working at something for a decade, it is fitting to ask, “What do you know now that you did not know before?” In short, we think that sharing needs to be studied much more, so this book is about sharing ideas and how data mining can help that sharing. As we shall see:

• Sharing can be very useful and insightful.

• But sharing ideas is not a simple matter.

The bad news is that, usually, ideas are shared very badly. The good news is that, based on much recent research, it is now possible to offer much guidance on how to use data miners to share.

This book offers that guidance. Because it is drawn from our experiences (and we are all software engineers), its case studies all come from that field (e.g., data mining for software defect prediction or software effort estimation). That said, the methods of this book are very general and should be applicable to many other domains.

1.2 What do we mean by “sharing”?

To understand “sharing,” we start with a story. Suppose two managers of different projects meet for lunch. They discuss books, movies, the weather, and the latest political/sporting results. After all that, their conversation turns to a shared problem: how to better manage their projects.

Why are our managers talking? They might be friends and this is just a casual meeting. On the other hand, they might be meeting in order to gain the benefit of the other's experience. If so, then their discussions will try to share their experience. But what might they share?

1.2.1 Sharing insights

Perhaps they wish to share their insights about management. For example, our diners might have just read Fred Brooks's book on The Mythical Man Month [59]. This book documents many aspects of software project management including the famous Brooks' law which says “adding staff to a late software project makes it later.”

To share such insights about management, our managers might share war stories on (e.g.) how upper management tried to save late projects by throwing more staff at them. Shaking their heads ruefully, they remind each other that often the real problems are the early lifecycle decisions that crippled the original concept.

1.2.2 Sharing models

Perhaps they are reading the software engineering literature and want to share models about software development. Now “models” can be mean different things to different people. For example, to some object-oriented design people, a “model” is some elaborate class diagram. But models can be smaller, much more focused statements. For example, our lunch buddies might have read Barry Boehm's Software Economics book. That book documents a power law of software that states that larger software projects take exponentially longer to complete than smaller projects [34].

Accordingly, they might discuss if development effort for larger projects can be tamed with some well-designed information hiding.1

(Just as an aside, by model we mean any succinct description of a domain that someone wants to pass to someone else. For this book, our models are mostly quantitative equations or decision trees. Other models may more qualitative such as the rules of thumb that one manager might want to offer to another—but in the terminology of this chapter, we would call that more insight than model.)

1.2.3 Sharing data

Perhaps our managers know that general models often need tuning with local data. Hence, they might offer to share specific project data with each other. This data sharing is particularly useful if one team is using a technology that is new to them, but has long been used by the other. Also, such data sharing is become fashionable amongst data-driven decision makers such as Nate Silver [399], or the evidence-based software engineering community [217].

1.2.4 Sharing analysis methods

Finally, if our managers are very experienced, they know that it is not enough just to share data in order to share ideas. This data has to be summarized into actionable statements, which is the task of the data scientist. When two such scientists meet for lunch, they might spend some time discussing the tricks they use for different kinds of data mining problems. That is, they might share analysis methods for turning data into models.

1.2.5 Types of sharing

In summary, when two smart people talk, there are four things they can share. They might want to:

• share models;

• share data;

• share insight;

• share analysis methods for turning data into models.

This book is about sharing data and sharing models. We do not discuss sharing insight because, to date, it is not clear what can be said on that point. As to sharing analysis methods, that is a very active area of current research; so much so that it would premature to write a book on that topic. However, for some state-of-the-art results in sharing analysis methods, the reader is referred to two recent articles by Tom Zimmermann and his colleagues at Microsoft Research. They discuss the very wide range of questions that are asked of data scientists [27, 64] (and many of those queries are about exploring data before any conclusions are made).

1.2.6 Challenges with sharing

It turns out that sharing data and models is not a simple matter. To illustrate that point, we review the limitations of the models learned from the first generation of analytics in software engineering.

As soon as people started programming, it became apparent that programming was an inherently buggy process. As recalled by Maurice Wilkes [443] speaking of his programming experiences from the early 1950s:

It was on one of my journeys between the EDSAC room and the punching equipment that hesitating at the angles of stairs the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.

It took several decades to find the experience required to build a size/defect relationship. In 1971, Fumio Akiyama described the first known “size” law, saying the number of defects D was a function of the number of lines of code; specifically

=4.86+0.018*loc

Alas, nothing is as simple as that. Lessons come from experience and, as our experience grows, those lessons get refined/replaced. In 1976, McCabe [285] argued that the number of lines of code was less important than the complexity of that code. He proposed “cyclomatic complexity,” or v(g), as a measure of that complexity and offered the now (in)famous rule that a program is more likely to be defective if

(g)>10

At around the same time, other researchers were arguing that not only is programming an inherently buggy process, its also inherently time-consuming. Based on data from 63 projects, Boehm [34] proposed in 1981 that linear increases in code size leads to exponential increases in development effort:

=a×KLOCb×∏i(Emi×Fi)

(1.1)

Here, a, b are parameters that need tuning for particular projects and Emi are “effort multiplier” that control the impact of some project factor Fi on the effort. For example, if Fi is “analysts capability” and it moves from “very low” to “very high,” then according to Boehm's 1981 model, Emi moves from 1.46 to 0.71 (i.e., better analysts let you deliver more systems, sooner).

Forty years later, it is very clear that the above models are true only in certain narrow contexts. To see this, consider the variety of software built at the Microsoft campus, Redmond, USA. A bird flying over that campus would see dozens of five-story buildings. Each of those building has (say) five teams working on each floor. These 12 * 5 * 5 = 300 teams build...

All prices incl. VAT