Lxcert 21.04.05 draft2
minutes of the LXCERT meeting 21.04.2005
Invited / Attendance
Abbrev | Name | Affiliation | Present |
---|---|---|---|
Alberto Aimar | LCG Application Area | missing | |
BB | Bruce M Barnett | ATLAS-online | |
AB | Alastair Bland | AB/CO-admin | |
Eric Cano | CMS-online | missing | |
JC | Joel Closier | LHCb-offline | |
NMN | Nicolas De Metz-Noblat | AB/CO-development | excused → AB |
BG | Benigno Gobbo | non-LHC experiments | |
JI | Jan Iven | IT-services catchall (chair) | |
NN | Niko Neufeld | LHCb-online | |
EO | Emil Obreshkov | ATLAS-offline | |
JP | Jarek Polok | general Desktops (secretary) | |
Fons Rademakers | ALICE-offline | excused → KS | |
Thorsten Kleinwort | IT PLUS/BATCH service | missing | |
KS | Klaus Schossmaier | ALICE-online | late |
Stephan Wynhoff | CMS-offline | missing |
Agenda
- plans for the next CERN Linux version
- Version proliferation & LCG collaboration (physics compiler)
- continuous SLC evolution & test releases
- post-mortem of SLC3, changes to the process and LXCERT membership
Summary
PROPOSAL TO BE DISCUSSED
Platform choice for LHC startup
None of the participants would like to see LHC start while still running SLC3 as the principal Linux version. An new version will have to be certified before, and the hard deadline when this version must be production-ready and fully certified is end of October 2006.
Red Hat Enterprise 4 (SL4, SLC4) is available (if not quite usable) now, but would be "ancient" at the time LHC starts (problems with newer hardware are to be expected). It comes with gcc-3.4.3 which is anyway the next compiler chosen by the LCG AF.
Red Hat Enterprise 5 (SL5, SLC5) would be the most attractive platform given that it will be relatively fresh at that time and will be using still a 2.6 kernel (hopefully well stabilized by then), but it will be only available late (expected 2Q2006). Any delay in certification would therefore risk the deadline and could impact computing at LHC startup. This risk was judged to be too high to now target SLC5 as only Linux OS platform for LHC startup. The final decision will have to be delayed, and an appropriate fallback solution needs to be put in place.
Therefore it has been decided to certify SLC4 formally, by using a parallel certification: In a first phase, LXCERT will validate the usability of the OS environment and non-physics applications, and in parallel AF will port and certify experiment code with gcc-3.4.3 on SLC3. Afterwards, the physics code will have to be validated on SLC4. At least initially no significant batch capacity on SLC4 will be required. An initial target date for this certification would be end of 2005. SLC4 would become the fallback solution for physics' computing and as such needs to 100% ready for operation by 2Q2006 at latest.
After the compiler version for RHE5/SLC5 has been fixed and as soon as Fedora Core4 or RHE5-beta reaches a sufficient level of stability, a similar parallel exercise (AF: port+certify compiler either on SLC3 or SLC4, LXCERT: validate OS; merge in a second phase) will be done for SLC5, to see whether SLC5 could be used for LHC startup.
The decision between SLC4 and SLC5 needs to be taken beginning of October 2006. In case certification fails or is delayed, SLC4 will be used. In both cases, migration to the chosen version is targeted in 4Q2006 and SLC3 would be phased-out (with a suitable delay) afterwards. In case SLC5 is chosen, the then-obsolete SLC4 will be phased out together with SLC3.
While this solution creates additional work (one of the certification will have been done in vain), it is felt that this is the only way to avoid lock-in to an obsolete operating system while keeping the risk under control.
Issue1: plans for the next Linux version
JI: Position summary from the mails sent before the meeting:
- SLC4 will be made available in any case for tests, but priority unclear
- go to SLC5 directly, don't certify SLC4: LHCb/Marco, ALICE/Fons, ATLAS/David
- need SLC4 (and some libraries): ATLAS-online/Bruce, LHCb-online/Niko, ALICE-online/Klaus (agree with ATLAS)
JI: (assuming compiler/OS split is OK): library availability for -online becomes an intra-experiment issue, please coordinate internally and with library providers, via AF.
JI: what are the reasons for SLC4 in "production" for anybody (e.g kernel-2.6 availability?), and when?
AB: NMN wants SLC4 in January 06, and would prefer to have LXPLUS with that since it then feels "fully supported" (can direct users to IT helpdesk etc). May not be a "hard" requirement.
NN: LHCb "slow controls" needs PVSS.. and compatibility would need
testing (e.g. >iostream.h< needs to be available..)
[JI: explicit action with ETM required during SLC3 certification]
[NN: aside: several issues still outstanding, SIGCHLD in log, /tmp-mode 0777]
AB: from Frank Schmidt (beam simulation): don't care about OS, need compiler (+ AFS capacity).
BB: 2.6 kernel isn't only point for ATLAS-online, stability is (release milestone in September 2006), unlikely to be in time if we only start a certification in summer 2006.
JP: last experience → 6month to certify software (deep chain). Please remember that compat-libs only work backwards for 1 release, so SLC3→SLC5 will need (most likely) a recompile.
BB: staying on SLC3 completely forces "revolutionary change" instead of "evolution"
NN: all the LHCb data acquisition is already running on 2.6 (on SLC3), only
own online code (including kernel drivers) . Benefit from "clean"
LHCb separation between online and offline
[AB: have 2.6 driver experience? interested..]
[mini diskless discussion → redirected to Linux4Controls/CNIC]]
JI: proposal: split by OS and compiler, certify in parallel, merge, decide quickly before the September 2006 deadline?
BB/EO: summer 2006 is too late for any new certification to start. In general agree that new compiler (on old OS) and new OS should go in parallel, then combine OS+compiler and run another round of tests.
[NN: when will support for SLC3 be dropped?
JI: other way round, decide when new stuff is available+migration
period etc., then phase out. RHE3 lifetime is until 2009
JP: but hardware support may push us, RHE3 only gets new drivers in
2005, and we don't want to carry too many versions around.]
JP: could have SLC4 installable in "basic setup" within a
few weeks. Do we need this?
Announced changes: mostly 2.6 kernel, gcc-3.4.3, general package update.
[diverge: EM64T support: looking at it (works also on SLC3, some trouble compat
between Opteron/Intel (wrong kernel))]
[diverge: updating system on SLC4? apt (have coded some dependencies
against it)?
JP: don't know.. x86_64 already on yum, may go to yum by default on
SLC4 - this is an internal
Linux.Support choice]
NN: could use the cert infrastructure even for SLC4 if non-production?
JI: Agree, either do "real" certification or none at all, certification does not automatically mean deployment.
JP: could start certification "early" on FedoraCore4, as a alpha-test for Red Hat 5,
but current rate of changes is horrible (FC4 is still in beta, will
cool down later). Useful for porting tests only (versus announced
changes), random runtime breaks could occur and would need to be
ignored.
AB: would be mostly interested in commercial applications + Java
JI: won't work now on FC4
JP: ORACLE will support RHE4+5, but significant lag for native
libraries, relies on compatibility environment until then.
[AB: assume SLC4: new control room (100 PCs about 60 Linux with
multiscreen+NVidia): will the close-source driver model continue
to work?
JP/JI: yes]
back to basic question - SLC4 or not?
JI summary:- need "backup" stable SLC4 in any case.
- rolled-out SLC4 makes migration to SLC5 easier (also for experiment)
JP: when is the last release date (fully certified, ready for deployment etc) of SLC5 that would get it into the LHC startup phase ?
AB: 1st November 2006 (general agreement, also from ATLAS, this is a HARD limit for AB-CO and LHCb)
[ AB: other changes that could go into next release:
SingleSignOn for Windows, shared home dir on DFS?
(would be nice for whole AB, Fermi has done SSO?)
JP: AFS not technically required for desktops/laptops,
but depends on software environment. Password sync : see
Linux4Controls?
DFS: No 'supported' DFS client on Linux...
JI: cert=time for changes, please let us know, can discuss later.]
JI: explains situation to be discussed at HEPiX (May 09):
SL(C)3 now is available world-wide (good for sites + experiments),
would like to keep this situation => need sites + experiments to agree
on timescale for next release.
Experiments: Please contact
non-CERN sites and get requirements/deployment plans. Need to match
experiment and site plans.
AB: assumption of 'Laptop=Desktop=control machine' is nice (compile
once, run directly), please keep if possible.
JI: still current working assumption:
laptop=desktop=PLUS=BATCH=online/controls/servers.
but actual roll-out may be staggered.
Conclusion:
- Split compiler and OS certification. Can in principle mix & match, but system compiler is somewhat preferred for 3rd-party compatibility
- SLC5 would be nicer to have for LHC startup than SLC4, but carries a risk => SLC4 needs to be fully & formally certified in 3Q2006
- if RHE5 looks promising (already from FedoraCore4/5 and the RHE5-beta) and is on time, look whether it can get fully certified with whatever compiler until October 2006.
- October 2006: decide which OS version + compiler version to deploy for LHC startup.
- only "test" SLC4 batch capacity (probably) required until that decision
- sometime before end of 2005, SLC4 should be a fully-supported OS (available for network installs, updated, helpdesk etc)
Issue2: Version proliferation & LCG collaboration
(intro by JI) 'LHC AF decides on compiler for non-LHC experiments'-issue:BG:
BG: bad.
LEP experiments have manpower problems (e.g. L3 has almost no more people
to port their code; the others are in a little bit better shape).
The lack of manpower for code porting is a general issue for non-LHCs.
So the common desiderata is to have "as less changes as possible".
On the offline side most issues are compiler-related and the main
request is to cotinue have libraries available (in particular CERNLib)
COMPASS probably will be the only running experiment in 2006.
[JC: confirms, nothing to do for ALEPH on SLC3 since gcc-3.2.3 was
already present]
JI:
- no manpower for tests → no veto even if formal right to do so
- LHC will have huge overlap in terms of requirements (FORTRAN, libraries etc) → will work most of the time
- non-LHC will still be "normal" client to Application Area/Compiler provider (just like AB), and can escalate via hierarchy if they feel treated unjustly
Conclusion: compiler/OS split is OK.
Issue2bis: continuous SLC evolution & test releases
(not discussed, shelved for now)Issue3: post-mortem of SLC3, changes to the process and LXCERT membership?
(some membership changes already happened before the meeting)
Issue:light-weight certification vs formal exercise:
NN: informal tests could be nice?
JI: results are useless unless some outcome is recorded (=formal yes/no, binding),
certification infrastructure is still useful,
deployment & certification can be decoupled
Conclusion: keep current certification infrastructure for now.
Issue:SLC3 post-mortem
JI: (quick rundown, will send details per mail):
Remember: SC3 was 12 month late!
[AB: but actually like the November release time]
Lost time with Red Hat negotiations (~6 months) and due to summer
(key people absent) and 'blocking' chain August-October:
CMS (and LHCb and ATLAS, but they didn't veto) → POOL → cxxabi bug.
(Trivial) lessons:
- any cross-site negotiations take long
- summer is bad for certifications (issue for SLC5!)
- need to break down dependency chains for quick certifications (e.g certify "production" scripts independently from working jobs?)
AOB:
AB: would like effort from IT/ORACLE for SLC4: one defined ORACLE
client production version (compatible with system compiler),
need Pro-C. If client distributed via AFS,
please on replicated volumes..
JI: need exact requirements, will forward. Explained split Physics-DB
(tied in via AppArea) / non-Physics (same situation as before).
BB: should we open 32 or 64 discussion?
JI: forecast: all current 3 platforms will need to be supported on OS level,
may drop IA64 in 2-3years,
decision on priorities/batch capacity is with AF/IT task force, after
performance evaluation
JP:Most likely will stay on 32bit (at least for desktops) for this year, and have additional delay for PC purchases anyway: need to redo market survey → nothing moves before FC in November(?)
Actions:
JI: send post-mortem summary
JI: send minutes
all: continue discussion via mail, get external site requirements
JI/JP: report from HEPiX
JP: (slowly) set up SLC4 for initial tests.