Network Monitoring

Networking Monitoring “wizards” like Matt Mathis, Terry Gray, and Claudia de Luna have long recognized the need for network performance monitoring.  A better set of tools for diagnosing transient network/system performance problems would easily alleviate the most common networking problems and result in overall performance improvements.

Network performance monitoring and measuring provides two key inputs for the exploitation of Grid technologies and e-Science.

  1. Firstly it provides knowledge of network metrics for use by the Grid resource broker services and the Grid middleware
  2. Secondly, it describes the network performance from the perspective of a Grid application which can be used to identify any strategic issues, e.g. bottlenecks, points of unreliability, Quality of Service needs, etc, which may arise and required action

The concepts and practice of network monitoring are well understood and are widely in use to identify problems, to quantify performance, and to set against expected levels of service. Network monitoring already addresses all level of network operation from basic connectivity to application throughput. It can for example provide a broad-brush sense of performance or look in detail at the behaviour and performance on TCP itself.

Monitoring for the Grid is different in intent and purpose. Grid monitoring deals with end-to-end performance. It is closely coupled with real Grid applications and may allow those application to vary their transport strategies for optimal performance by, for example, tuning TCP parameters, running multiple TCP streams, or by making use of QoS provision in the network. To achieve this end, the products of monitoring, the network metrics, are made available through publication to the Grid middleware. This same data can also easily be made available to the end user and to network service providers.

To that end, work has been funded to design and deploy a monitoring infrastructure within the e-Science community. This paper describes the vision of that architecture; a summary of the infrastructure, the tools and methods to be used; and the detail of progress made towards its completion. As a part of the presentation, a live demonstration of capability will be provided.

A Network Monitoring Architecture for the Grid

The architectural design has as its objective a simple, and easily extensible framework within which a variety of monitoring tools may be deployed. This architecture (Figure 1) permits the publication of network metrics to the Grid middleware, and makes them available, via visualisation, to the human observer.

Network Monitoring-Architecture

Grid applications are able to access the network metrics via an LDAP service according to a defined LDAP schema. The LDAP service gathers and maintains the metric data via scripts that fetch or have pushed the current measurements from the local network monitoring data store. Independently a set of monitoring tools run to collect monitoring data. This describes, from the local perspective, the view of network access to other sites in the virtual organisation. In addition scripts are associated with the monitoring tool to provide Web based access for viewing and analysis of the raw data. This architecture allows additional monitoring tools to be easily added with the only requirements being the provision of the means for analysis and visualization of the data; a means of transporting the data to the LDAP service and of including the products within the schema itself.

Metrics, Measurements and Tools

The monitoring described here is concerned with an understanding of round trip time (RTT), packet loss, each center to all other likely collaborators. This is of course a necessity if the products of monitoring are to be used in Grid resource brokerage. Further work in this area will be required, and already the EDG are exploring the use of derived metrics such as the notion of “closeness”.

In the initial deployment, use will be made of simple Web based graphics to display the output of monitoring but developments to provide better integration are already planned.

Typical Output of Network Monitoring

The following series of Figures show typical output from the set of monitoring tools that are being deployed.

Network Monitoring

Figure 2: Sample output from IEPM monitoring between SLAC and Daresbury

Figure 2 shows the last 28 day period output of the IEPM monitoring between SLAC and Daresbury showing iperf data (green), bbftp data (pink) and ping data (red) amongst other metrics being displayed.

Figure 3 shows packet loss (blue) and RTT output (red) for PingER between theOxfordand Daresbury e-Science centres for the 14 day period to 12 July.

Network Monitoring-Sample

Figure 3: Sample output from PingER monitoring between Oxford and Daresbury e-Science Centers

Figure 4 shows UDP throughput measured over a 12 day period by UDPmon betweenManchesterand UCL showing throughput (red) and packet loss (blue)

Network Monitoring-Sample output from UDPmon

Figure 4: Sample output from UDPmon monitoring between Manchester and UCL

 

Network Monitoring Gap

There is a gap between the theoretical potential of a network and the speed at which it actually delivers data to a given application.  The larger the gap, the more poorly your network is performing.  Measuring this performance gap helps network engineers track the way that speed and reliability vary over time and across applications.  Discussing end-to-end measurement places the focus on the end-user’s experience and this, after all, is what we want to improve.  Capturing such data and using it to improve performance is relevant for Alpha University because of the many data-intensive research projects taking place on campus.  Campus researchers are concerned with the overall performance and end result of particular applications.  Application performance is reliant on network performance.  The performance of the network is the concern of the various network operators along the path of the connection – including those on campus.  Both groups have different and separate areas of technical focus and expertise.  The tools and initiatives developed in support of end-to-end performance monitoring must bridge these groups and provide support for their individual goals.

 

Due to the nature of the Internet the identification and resolution of end-to-end performance problems can be very difficult.  As data bounces along routers between various domains (operated by different organizations and administered by different individuals) there is little the end-user can do to isolate a problem.  However, with performance measurement tools an end-user can help to identify where a problem might be and inform the appropriate network administrator of their difficulties.  Armed with specific data from the user the network administrator is more inclined, and better equipped, to track down and resolve the problem.

 Network Monitoring Systems

It is unfortunate that we have waited so long to implement performance-network monitoring systems.  Had the original architects of the Internet envisioned the many ways in which it is being used today to transfer vast amounts of information such systems would, most likely, have been built into the system’s original architecture.  Engineers once believed excess bandwidth would solve all performance problems and that network monitoring would not figure into quality service provision.  This has not proven to be the case and the deployment of these services should not be delayed any further.  Due to years of poor or inconsistent quality of service from campus networks the expectations of users have fallen.  Yet their reliance on and need for network performance has risen as applications have become more complex and data-driven.  Problems are patched in emergency situations and little attention is paid to underlying causes.  Network administrators often allow the technology to operate below its full capacity because of the difficulty in locating and resolving performance problems.  The overall result is a poor reputation for what could be a state-of-the-art network with a strong information support team to work.  Leveraging the experience of “wizards” already employed by the institution and lessons already learned can inform troubleshooting to a much greater extent.  Measurement can also help for future capacity planning so that growth can be anticipated and similar issues in the future can be avoided.  In particular, performance measurement can address the following institutional goals:

 

  • Quality control within network
  • Quality control across peering agreements
  • End-user satisfaction
  • Improved reputation of research initiatives on campus

 

Campus participation is crucial to the effectiveness of end-to-end performance monitoring because most of the work (and workers) are on campus.  Participating in network measurement activities will enable researchers to resolve end-to-end problems quickly and easily.  Researchers, educators, and professionals on participating campuses and in labs will benefit from the efforts to increase performance.

 Network Monitoring Development

This article focuses on the importance of network performance measurement and network monitoring in developing and growing high-speed networks on and between campuses and the role of measurement services in strategic planning.

Ever-increasing network speeds and the construction of applications to exploit them are driving network administrators to meet the needs of a much more demanding research community.  Researchers on campus and in labs have increasing expectations of network performance as applications become more sensitive and dependent on consistent, high-quality network performance.

CIOs and the networks they administrate find themselves poised to benefit from the implementation of extensive measurement tools.  A number of drivers both inside and outside our institutions are moving our campuses toward implementing this infrastructure.  They include:

  • Increasing requirements for interdisciplinary and inter-institutional research and collaboration.  Academic collaboration requires appropriate sharing of data and resources among institutions.  Scientific communities must be able to communicate quickly and reliably over long distances.  Massive amounts of data are sent between sites on a regular basis.  Video conferencing and other distance learning tools would also benefit greatly from monitored network performance.
  • Changing needs of researchers.  As mentioned above, researchers today rely on applications that push technology to the limit.  Extremely high-speed networks are a necessity to support the applications in use today.  Basic network infrastructure must not be responsible for holding back research.
  • Escalating expectations for 24-7 access to and use of optimally performing technology.  On one hand, many users of high-speed applications have become used to below-optimal performance.  However, as new applications are developed and objectives are created there is a need to proactively change user expectations and improve the image of the network.
  • Increasing budgetary pressures.  Networks must be effective and performing well to be cost-effective.  Why rewire the network or install optical fiber if the end-to-end performance does not reflect that investment.  Make sure that your investments provide all the return they can and earn you the recognition you deserve.

Ever-increasing network speeds and the construction of applications to exploit them are driving network administrators to meet the needs of a much more demanding research community.  Researchers on campus and in labs have increasing expectations of network performance as applications become more sensitive and dependent on consistent, high-quality network performance.