Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 User Connectivity Problems Working Group M. Mathis Internet Draft PSC D. Long BBN November 11, 1991 FYI on an Internet Trouble Ticket Tracking System for addressing Internet User Connectivity Problems Status of this Memo This Internet Draf FYI describes a possible approach to improving the usability of the Internet. It is being distributed to members of the Internet community in order to solicit their reactions to the proposals contained herein. The proposed paradigm is not intended as a standard for the Internet. Rather, it is hoped that a general consensus will emerge as to the appropriate solution to this problem, perhaps leading to the formulation and adoption of standards. This memo does not specify a standard. Distribution of this memo is unlimited. Author's Address This paper introduces a concept by Matt Mathis and the members of the IETF User Connectivity Problems Working Group. Please send correspondence to ucp@nic.near.net. This list may be subscribed to by sending a request to ucp-request@nic.near.net. Archives are located in mail-archives/ucp on nic.near.net. Security Considerations This RFC raises no security issues, however further refinements of the proposed model will need to address security requirements. Abstract Users having trouble with the Internet are directed to contact their designated Network Service Center. The Network Service Center creates a Trouble Ticket which is registered with the Ticket Tracking System. The ticket is an agreement to obtain closure with the user. Network Service Centers can fix problems, track the work of others, or transfer responsibility for the ticket to other Network Service Centers using a formal hand-off procedure. Ticket hand-offs are coordinated by the Ticket Tracking System and ticket progress is monitored by the Ticket Support Centers. User User Connectivity Problems Working Group [Page 1] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 complaints with the problem resolution process may be lodged with a Ticket Support Center, which will act on behalf of the user in resolving the problem. Preface In this document formal rules and assertions are left justified while commentary and conventions are indented. The formal rules state the requirements for tracking trouble reports. The commentary describes how we expect the rules to be invoked. The scheme that we are describing here is somewhat like the game of bridge: the "rules" are very simple, but the "conventions" are self adjusting and very complex. The commentary in this document is intended to seed conventions but not to mandate their future. Network Service Centers (NSCs) Network Service Centers are the principle agents of problem resolution. They generate, hold and close tickets, perform diagnostics, make repairs, etc. NSCs are self defined by agreeing to adhere to the rules described in this document. The NSCs register their agreement to comply with these guidelines in order to prevent an NSC from accidentally transferring a ticket to a NOC which is not participating, who in turn may fail to properly transfer or close a ticket. The only requirement for registration as an NSC is that the organization agrees to honor the rules for handling tickets. In most cases NSCs are existing NOCs, however other agents who are not customarily viewed as NOCs could be NSCs. Examples include the operator of any Internet resource, or the network software support group of a computer hardware or software vendor. It is expected that almost all regional NOCs and some of the larger campus NOCs will become NSCs. NSCs which find themselves chronically acting on behalf of some non-registered NOC will encourage it to become registered. Thus there is a built-in pressure to help existing NOCs to become registered. The current list of NSCs must be available online in an ASCII database (an NSC phone book). Besides listing contact information, it should list responsibilities and areas of expertise. User Connectivity Problems Working Group [Page 2] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 By responsibilities, we mean specific components of the Internet infrastructure which a particular NSC has direct responsibility for maintaining. Examples include: Nets, Name servers, and ASes belonging to a particular organization with an NSC. Also, as separate items, "connectivity to" nets or ASes which are nominally transit. By "areas of expertise", we mean areas where an NSC has significant technical background and may be able to provide general help diagnosing problems. For example, an NSC which runs a large nameserver may be willing to help other NSCs diagnose nameserver problems. If the phone book covers a wider scope than this procedure (for example a complete listing of contacts for all networks, domains and Administrative Systems for the entire Internet) then each entry should be tagged to indicate if the contact or NOC has agreed to honor these procedures. Tickets A Ticket is a commitment to obtain closure with a user. Tickets are created when a user reports a problem to an NSC. Tickets are closed when the user is informed of the resolution of the problem. Only a registered NSC can hold a ticket. An NSC must never refer one of its users to another NSC. An NSC may refer other users to another NSC but must include a pointer to a Ticket Support Center as well. THIS IS THE CENTRAL POINT: A Ticket is a commitment to obtain closure with a user. These tickets are not intended to track problems. They are to assure that user complaints are never lost. This intent does not preclude their use for other purposes. These tickets may not be suited to an organization's own internal practices. Most existing ticket systems track problems, not complaints. The scheme described in this document does not address all issues required to reliably track all problems. These potential shortcomings might be addressed in either of two ways: Implement a system which is a superset of the functions described here, or an entirely different problem tracking system with cross references to these tickets. Tickets are not commitments to fix problems. An NSC can choose to hold a ticket, monitor someone else (e.g. another NOC) making a repair, and then contact the user itself to be sure that the user is satisfied. Or, it could transfer the ticket (see below) to the NSC responsible for the repair who would then contact the user. User Connectivity Problems Working Group [Page 3] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 By "User", we mean anyone who is not a registered NSC. Users can be true end users, host administrators, campus net administrators, or unregistered NOCs. If the person who reports a problem is a non-NSC site administrator acting on behalf of a true end user it is desirable to get contact information for both the site administrator and the end user. If there is any evidence that the site administrator is not adequately passing information to the end user, the NSCs should contact the end user directly. Also, there is opportunity to recruit the site administrator as an NSC. Each NSC must publish a description of its user community in the NSC Phonebook. For example, a regional network's user community might be "official technical contacts at regional member sites and anyone outside the network having difficulty reaching the regional network". A backbone provider's user community might be "official technical contacts at one of the member networks served." If someone contacts an NSC who is not in that NSC's designated user community, then the NSC must consult the NSC Phonebook and refer them to their appropriate NSC. They must also tell the user how to contact the Ticket Support Center, which will help resolve any confusion about which NSC the user should be contacting. This policy of mandatory-redirect may expose well-known NSC's to a large number of calls. Practically speaking, it may become necessary to refer those users only to the Ticket Support Center which can then keep a tally of which users are calling the wrong NSC. If the Ticket Support Center were to produce a monthy report of redirects, NSCs might be encouraged to improve their efforts at end-user education. An NSC may choose not to open a ticket for some classes of "simple" reports in which closure with the user is obtained "immediately". However, there are several risks to such policies. If the user has complaints about the NSC's performance, the NSC will have no documentation for its defense. In general, it is best for NSCs to err on the side of entering tickets for insignificant events. An NSC may act as a user and open a ticket either with itself or another NSC. User Connectivity Problems Working Group [Page 4] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 Ticket Tracking System There is a Ticket Tracking System responsible for the mechanics of tracking tickets. The TTS should nominally be fully automated, being accessed by the various NSCs via the network. It should also support limited telephone queries from NSCs to confirm the status of particular tickets. The TTS is also responsible for archiving completed tickets. The TTS must always have a recent copy of the ticket, which NSC is holding it, and various statuses. For the initial implementation the primary channel to the TTS should be via SMTP mail. Queries and updates are posted as mail messages, with a ticket number and function as either the subject or first line of the body. All functions which change tickets are implicitly appends: no portion of a ticket is ever deleted. At some point this should be migrated to privacy-enhanced mail. The TTS, in conjunction with SMTP mail, are really a substitute for a distributed database with a public interface. When a real distributed database becomes publicly available (meaning runs on enough platforms at a low enough price not to exclude any NSCs) we should be prepared to migrate to it. The detailed requirements for the TTS belong in a future RFC. There is a formal mechanism for passing tickets between NSCs. This mechanism must be designed such that tickets cannot be lost. A possible procedure to pass a ticket from R1 to R2 might be as follows: 1) R1 first inquires (out-of-band) if R2 is willing to accept the ticket. 2) If R2 is unwilling, R1 must continue effort on the ticket, either to find a willing NSC or to repair the problem itself. 3) If R2 is willing to accept the ticket, R1 sends a message to the TTS with any final remarks, notifying the TTS of its intent to transfer the ticket to R2. 4) R2 sends a request to the TTS notifying it of its intent to accept the ticket from R1. 5) The TTS sends conformation notices to both. In the notice to R2 it includes the entire current content of the official ticket. User Connectivity Problems Working Group [Page 5] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 6) R2 informs the user that the Ticket has been transferred and provides any updates. (This is required.) 7) R1 optionally contacts the user to reassure him that the problem is being work on. This is particularly useful if the initial NSC contacted is not the "closest" NSC, in order to encourage a direct query to the closer NSC in the future. The out-of-band inquiry in step 1 is the most important part of the process. By "out-of-band," we mean not prescribed by this process. Many problems may be resolved at this point without transferring the ticket. It is likely that R2 is either already working on the problem or can fix it on the spot. In these cases, R1 should confirm the repair and contact the user to close the ticket. It is entirely acceptable for R2 to suggest that some other NSC is more appropriate to deal with the problem. NSCs may refer NSCs to other NSCs. There are some potential race/misbehaviors, particularly since the SMTP delivery can not be assumed to be 100% robust. However there are strong timer based and out-of-band checks possible (R1 asks R2, "Did you receive confirmation of the transfer?"; R2 asks the TTS, "Who holds the ticket?". TTS contacts R1 and/or R2 if it receives one of #3 or #4 without the other in the prescribed time period). Timers can be associated with all of the above states such that the TTS can detect protocol botches. The ticket transfer procedure must be described in complete detail in a future RFC. Ticket Support Centers There is a small set of Ticket Support Centers, to deal with problem tickets. The TSCs are responsible for monitoring the quality of the NSCs and the ticket handling procedures. There are three separate functions: expediting tickets which are not making adequate progress (as detected by the TTS timers), arbitrating between NSCs, and acting as a user ombudsman. The problems covered here represent potential failures of the ticketing mechanism itself. They do not represent normal escalation of tickets within the system. See comments below about ticket flow. The three functions are really independent and could co-reside with some NSCs, such as the NSCs for the backbone networks. User Connectivity Problems Working Group [Page 6] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 Tickets which remain in any state for extended periods of time without transactions are likely to be broken. Some states, such as during handoff between NSCs are always transient, and any persistence in these states is probably a handoff failure. Other states such as "we are working on it" should include progress reports or else they are suspect. We want to be able to track chronic minor problems so it may not be unusual for some tickets to stay open for a long time. If an NSC realizes that some problem is beyond its abilities and can not find another NSC to take the ticket, or the other NSCs refuse the ticket then TSC can be invoked to find someone who will accept the ticket. In all cases where a ticket is refused under controversy, statements from all parties should be included in the ticket. (See meta dialogue below). The ombudsman should have a widely published phone number for users who feel that they have not received appropriate service from an NSC. The ombudsman should not act as a first point of contact. Any comments by the ombudsman should be recorded in the ticket. The details of the structure of the ticket flow between NSCs is outside the formal scope of this document. The NSCs are all peers in the sense that any NSC can transfer any ticket to any other NSC willing to accept it. The normal escalation of tickets happens implicitly as the result of rules for transferring tickets between centers. Problems with Internet connectivity will naturally follow the hierarchy of the Internet and the contractual agreements at interconnections between ASes. Problems with the Domain Name System will naturally follow its hierarchy. Interoperability problems will naturally flow horizontally between NSCs near the end systems, and perhaps to NSCs operated by the end systems' vendors. We claim that the above ad-hoc structure is already in place and functioning very well for the majority of (simple) problems. The formalism of ticketing is needed to prevent malformed or difficult problem reports from vanishing without being adequately addressed. The phone book is critical to optimizing this process. A problem which seems to be circa some resource is most likely best dealt with by the NSC nearest that resource, or they are likely to have a clearer picture which other NSC should be involved (Since NSCs are allowed to redirect other NSCs). Since these flows are self organizing, they will adapt to new services and as the Internet evolves. User Connectivity Problems Working Group [Page 7] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 The only externally applied pressure to influence the ticket flow is that users are encouraged and, if necessary, redirected to contact the "nearest" NSC. (See below) The body of a ticket is anticipated to be a chronological series of entries. These entries can be characterized as falling into one of four different kinds of stream of thought or conversation, called dialogues. A ticket has an abstract, which is excerpted from the dialogues. The Ticket is "append only". The only content to a ticket which is not part of a dialogue is the identifier of the ticket itself. The dialogues are "user", "operations", "engineering" and "meta". We will not be proposing the details of ticket format or representation except that a ticket has no content outside of the dialogues. The abstract is a summary of the current state and history of a ticket. Data which might ordinarily be considered "header information" is included in the abstract. The abstract must be algorithmicly generatable from the dialogues. (i.e. it need not be stored separately, and can not be edited directly). In the simplest case an abstract entry is the last of a particular entry in a dialog. For example, the NSC which holds the ticket will always be named by the last entry transferring the ticket. For efficiency reasons the abstract may be stored and not recomputed, but then the entire history of the abstract will be visible in the Ticket. There must be detailed specification for the format of a ticket in a future document. The simpler the constraint on the format the easier for many diverse NSCs to use, but the harder it will be to do meaningful post-processing (collect statistics on failures, etc). The format should be extensible, so more structured detail can be added later. The initial format should be as simple as reasonably possible. Later on, as we have a better understanding of what post-processing is needed, it should become more structured. Each dialogue has specific participants, audience and objectives. One or more of the dialogues may be empty in a specific ticket. It is anticipated that some of these dialogues, notably Engineering and Meta, will be added after the ticket is closed with the user. The user dialogue relates all conversations with the user who reported the problem, including his contact information, initial description of the symptoms, record of additional reports about additional symptoms and notification of progress. The transaction closing the ticket will always acknowledge contact with the user. User Connectivity Problems Working Group [Page 8] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 Every single exchange with the user should be recorded. If multiple users call about the same problem, they can be carried as separate user dialogues within the same ticket. In that case, however, the NSC must reach closure with all of the users in order to close the ticket. The operations dialogue relates history of the technical effort to resolve a problem, including diagnostic results and interpretations, and repair. This dialogue most closely resembles traditional trouble ticketing systems, though in the larger context of this paper, it is possible to achieve closure with a user while retaining action items of operational or other classifications. While an exhaustive description of this situation is outside the scope of this document, it is expected that this will typically result in the NSC generating a new ticket on the pending action item. The engineering dialogue is a commentary on how the infrastructure could be improved to prevent future occurrences of this problem. We take the position that any failure which is detected by a user before being repaired by a NOC really contains two failures: the "operational" problem (what failed), and an "engineering" problem (why it affected the user). In an ideal world all facilities should be sufficiently redundant and monitored such that all failures are detected and repaired by a NOC without being noticed by the users. The engineering portion of the ticket is to help guide us to that goal. Many problems are not operational and can not be "repaired" in any conventional sense. Examples include problems caused by insufficient bandwidth or other resources limitations. These tickets enter a "purgatory" state where closure has been obtained with a user (say by providing a work around) but the real problem requires more than a "repair" and is still present. By definition these tickets can be closed, because there has been closure with the user, but it is imperative that the engineering commentary be escalated out of the ticket system and into the Internet engineering and planning organizations. See the discussion of closing statuses. Since operations personnel are often ill-equipped to evaluate engineering issues, engineering dialogues may be entered by the engineering staff after the ticket has been closed with the user. User Connectivity Problems Working Group [Page 9] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 The meta dialogue is a commentary about the ticket process itself. What improvements should be made to the ticketing mechanism to make it more effective in the future? This would address questions such as "Were the rules followed?", "Are the ticket routing conventions appropriate?", "Is the phone book accurate and complete?", "Did a ticket handoff fail?", "Was a ticket completed in a reasonable time?", etc. There are several issues about how the user interacted with the ticketing system which should also be addressed here: If the user did not contact the nearest NSC, the NSC may be insufficiently educating its users. If the user contacted more than one NSC, the first NSC may not be doing its job or may be violating the "Don't refer users" rule. If the user contacted more than one NSC, the user may be unreasonably impatient. The meta dialogue serves as a quality control check on the ticketing process and to drive the user education process. The final closing status of a ticket can be used to request additional effort. We believe this to be the most difficult and most critical part of the entire process. Many problems reported as short term operational problems are really long term engineering and economic issues. These must be identified and corrected before they destabilize the entire Internet. The solution proposed here is incomplete and insufficient. We believe that the only tickets of real interest are precisely the ones which will tend to be closed with the user without true resolution. The final status can be decomposed into four parts: an estimate of the user's satisfaction with the resolution, an operational action-item for facilities outside of the scope of the ticketing system (e.g. upgrade an out-of-revision host), and action-items directly from the engineering and meta dialogues in the ticket. User Connectivity Problems Working Group [Page 10] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 These final statuses should have formal representations such that statistics and actions-items can automatically be generated from closed tickets. However at this time it not at all clear how this should be done. This is so critical to the long term success of the Internet that this should be solved in phases. A future RFC should spell out some clearly interim closing statuses, with explicit hooks for a future mechanism. Only after we have operational experience with real Internet-wide tickets will we really understand this issue. The user's satisfaction is the key result. If the user is dissatisfied, this may be an indication that the ticket was closed too early. An NSC may not always be an objective determiner of a user's satisfaction. If users are constantly appealing their problems to TSCs and reporting markedly different assessments of their level of satisfaction, then the TSC should raise this issue with the NSC and perhaps its funding agency. An NSC is not required to take any actions after closing a ticket. No mechanism is provided for the evaluation of post-ticket actions taken by NSCs. However, we expect NSCs will be motivated to take action so as to prevent recurrence of similar tickets or problems. Residual operational action items should be explained to the user so that the user can follow up with the responsible parties. This will often be the case where it is the user who must take some action to solve the problem. This must not become a mechanism for prematurely closing tickets and leaving the user to track the problem. For example, the owner of a broken host is the only party who can fix it. Every NSC has different priorities and constraints. An action that one NSC might consider routine might be a major project for another. For example, the NSC for a national backbone network might be able to make a trivial configuration change to monitor another operational parameter of its network whereas, for a small corporate NSC, that kind of change might mean purchasing an entire network monitoring package. Ticket dialogues are, by default, private communications. NSCs determine the extent to which tickets they hold are made public. Rights to dialogue contents and responsibilities for dialogue privacy are transferred among NSCs along with the tickets themselves. The TTS maintains copies of all open tickets. It provides copies only to TSCs, the current NSC, and any NSC cited in the dialogues. Any NSC so cited has the right to append commentary. User Connectivity Problems Working Group [Page 11] Draft FYI on an Internet Trouble Ticket Tracking System Nov 1991 Every NSC must establish a policy for itself about the extent to which it will make dialogues public. For example, NEARnet currently makes the operations dialogue available to the user and interested network members. It generally limits the engineering dialogue to its staff and technical advisory committee and it limits the meta dialogue to its project staff. Note that NSC receiving public funding are likely to be required to make reports, however detailed, to their sponsors. The User/NSC/TTS/TSC structure proposed here assumes a degree of cooperation among components. This cooperation can arise from any of several sources including, but not limited to, a service commitment to end-users, a desire for enhancing usability or the Internet, or contractual decree of some funding organization. We hope that the system will function well regardless of the reasons various organizations have for being a part of it. We expect that some issues will transcend the bounds of this system and be addressed out-of-band by network providers, funding organizations, end-users, and market demands. User Connectivity Problems Working Group [Page 12]