Problem-solving in high performance computing a situational awareness approach with Linux

Problem-Solving in High Performance Computing: A Situational Awareness Approach with Linux focuses on understanding giant computing grids as cohesive systems. Unlike other titles on general problem-solving or system administration, this book offers a cohesive approach to complex, layered environmen...

Descripción completa

Detalles Bibliográficos
Otros Autores: Ljubuncic, Igor, author (author)
Formato: Libro electrónico
Idioma:Inglés
Publicado: Waltham, MA : Morgan Kaufmann [2015]
Edición:1st edition
Colección:Gale eBooks
Materias:
Ver en Biblioteca Universitat Ramon Llull:https://discovery.url.edu/permalink/34CSUC_URL/1im36ta/alma991009629690206719
Tabla de Contenidos:
  • Identification of a problemIf a tree falls in a forest, and no one hears it fall; Step-by-step identification; Always use simple tools first; Too much knowledge leads to mistakes; Problem definition; Problem that happens now or that may be; Outage size and severity versus business imperative; Known versus unknown; Problem reproduction; Can you isolate the problem?; Sporadic problems need special treatment; Plan how to control the chaos; Letting go is the hardest thing; Cause and effect; Do not get hung up on symptoms; Chicken and egg: what came first?
  • Do not make environment changes until you understand the nature of the problemIf you make a change, make sure you know what the expected outcome is; Conclusions; References; Chapter 2 - The investigation begins; Isolating the problem; Move from production to test; Rerun the minimal set needed to get results; Ignore biased information; avoid assumptions; Comparison to a healthy system and known references; It is not a bug, it is a feature; Compare expected results to a healthy system; Performance and behavior references are a must; Linear versus nonlinear response to changes
  • One variable at a timeProblems with linear complexity; Nonlinear problems; Response may be delayed or masked; Y to X rather than X to Y; Component search; Conclusions; Chapter 3 - Basic investigation; Profile the system status; Environment monitors; Machine accessibility, responsiveness, and uptime; Local and remote login and management console; The monitor that cried wolf; Read the system messages and logs; Using ps and top; System logs; Process accounting; Examine pattern of command execution; Correlate to problem manifestation; Avoid quick conclusions; Statistics to your aid; Vmstat
  • IostatSystem activity report (SAR); Conclusions; References; Chapter 4 - A deeper look into the system; Working with /proc; Hierarchy; Per-process variables; Kernel data; Process space; Examine kernel tunables; Sys subsystem; Memory management; Filesystem management; Network management; SunRPC; Kernel; Sysctl; Conclusions; References; Chapter 5 - Getting geeky - tracing and debugging applications; Working with strace and ltrace; Strace; Options; What you need to know before using strace; Strace from the standpoint of a system administrator; Strace has friends; Basic usage; Test case 1
  • Test case 2